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PROGRAM  SPEEDUP  THROUGH  CONCURRENT  RECORD  PROCESSING 


Richard  Ernest  Strebendt,  Ph.D. 

Department  of  Computer  Science 

University  of  Illinois  at  Urbana-Champaign,  1974 


Much  effort  in  the  past  has  been  devoted  to  speeding  up 
computational  programs  through  the  use  of  multiprocessing.  This  paper 
examines  the  problem  of  speeding  up  data  processing  programs  which  typ- 
ically do  not  contain  a  great  deal  of  computation. 

A  machine  organization  is  proposed  which  is  capable  of  execu- 
ting several  instruction  streams  concurrently.  Compiler  algorithms  are 
described  which  automatically  insert  the  necessary  commands  to  start 
and  stop  instruction  streams  and  to  protect  common  variables  which  must 
be  accessed  sequentially. 
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1.   INTRODUCTION 

1 .1  Approaches  to  Program  Speedup 

A  continuing  concern  [CHE71a,  BEL72,  WIT72]  in  Computer  Science 
is  the  problem  of  speeding  up  the  execution  of  programs.  In  the  early 
days  of  computers  this  was  primarily  attacked  by  speeding  up  the  cir- 
cuitry of  the  machine  itself.  Faster  devices  were  developed  as  relays 
gave  way  to  vacuum  tubes  which  gave  way  in  turn  to  semiconductors.  More 
efficient  algorithms  for  arithmetic  were  found  and  continue  to  be  inves- 
tigated. Now  that  physical  limits  are  in  sight  for  the  speed  of  devices, 
the  emphasis  [GIL58,  MUR66]  on  program  speedup  is  being  placed  on 
parallelism  in  the  execution  of  the  program.  Machines  have  been  devel- 
oped [BAR68]  to  exploit  the  parallelism  inherent  in  array  operations. 
The  parallelism  present  in  algorithms  for  arithmetic  operations  has  also 
been  utilized  in  pipelined  arithmetic  units  [SEN65,  SEN67,  HIN72, 
WAT72]. 

Some  consideration  has  been  given  [ASC67,  F0S71 ,  FLY71 ,  FLY72, 
CUR73,  BAE73]  to  the  possibility  of  using  more  than  one  processing  unit 
to  execute  a  program,  with  each  processor  executing  independently  of  the 
others.  Two  problems  arise  when  this  is  done.  The  first  is  the  problem 
of  conflicts  in  accessing  data  common  to  several  processors.  For  a 
particular  class  of  programs  this  problem  is  solved  by  inserting  a  com- 
plex set  of  tests  around  the  instructions  referencing  the  common  data 
[DIJ68a,  DIJ68b,  C0U71 ,  EIS72,  HAB72]  to  allow  any  processor  to  access  the 
common  data  so  long  as  no  other  processor  is  accessing  that  data.  More 


commonly,  however,  the  problem  is  avoided  by  allowing  processors  to 
simultaneously  execute  only  tasks  which  are  independent.  This  leads  to 
the  second  problem,  that  of  identifying  independent  tasks.  Mechani- 
cally finding  independent  tasks  within  a  program  can  be  done  [BER66, 
RAM69,  RUS69,  TJA70],  but  for  a  large  program  this  can  be  expensive  in 
machine  time.  The  approach  suggested  by  several  investigators  [C0N63, 
AND65,  0PL65,  WIR66]  is  that  of  requiring  the  programmer  to  specify  in 
his  program  where  he  thinks  the  processors  should  be  started  and 
stopped.  For  an  occasional  program  this  might  be  a  workable  technique, 
but  for  a  programmer  with  a  heavy  work  load  it  would  be  too  time  con- 
suming and  error-prone  to  be  a  useful  technique. 

In  this  paper  we  also  attack  the  problem  of  program  speedup 
through  the  concurrent  operation  of  more  than  one  processor.  Our 
approach,  however,  is  different  from  that  in  previous  work  done  and 
yields  a  potentially  very  high  speedup  without  searching  for  indepen- 
dent tasks  within  a  program.  We  attain  a  program  speedup  by  executing 
the  program  concurrently  with  itself,  with  each  instruction  stream  (or 
copy  of  the  program)  processing  a  different  set  of  input  data.   No 
parallel  tasking  is  attempted  within  an  instruction  stream.  It  is 
shown  in  this  paper  that  this  method  of  achieving  concurrency  has  the 
following  advantages: 

1)  It  is  not  necessary  to  compare  all  tasks  with  all 
others  to  find  those  which  are  parallel  executable. 


The  bulk  of  this  paper  assumes  that  only  one  program  at  a 
time  is  in  execution.  The  modifications  needed  to  extend  the  machine 
proposed  in  this  paper  to  multiprogramming  are  discussed  in  section  5.10, 


2)  The  instructions  to  start  and  stop  processors  can 
be  inserted  very  easily  by  the  compiler  into  a 
program  written  for  a  single  processor  machine. 
This  relieves  the  programmer  of  this  burden.  Also, 
the  locations  of  these  instructions  can  change  with 
each  compilation  as  the  program  changes.  Thus  the 
programmer  can  concentrate  on  what  he  wants  the 
program  to  do  and  not  on  how  the  machine  does  it. 

3)  The  interlocking  operations  can  be  inserted  by  the 
compiler  without  the  intervention  of  the  programmer. 

4)  The  interlocking  conditions  are  fairly  simple  and 
can  be  relegated  to  an  inexpensive  hardware  unit. 

1.2  Characteristics  of  COBOL 

Much  recent  work  on  the  speedup  of  programs  through  multipro- 
cessing emphasizes  the  speedup  of  arithmetic  [MUR71 ,  KRA72,  KUC72b]. 
In  many  computational  programs  the  speedups  thus  gained  are  substan- 
tial. In  many  data  processing  programs  written  in  COBOL,  however, 
little  is  gained  in  this  way  since  there  is  little  arithmetic  in  them. 
It  might  be  asked:  why  worry  about  COBOL  programs?  The  answer  is 
quite  straightforward;  more  programs  are  written  in  COBOL  than  in  any 
other  language.  Indeed,  a  recent  survey  [PHI73]  of  language  users 
indicates  that  more  programs  are  written  in  COBOL  than  in  all  of  the 
other  languages  combined.  The  economic  benefits  resulting  from  improv- 
ing the  execution  speed  of  COBOL  programs  should  be  well  worth  the 
effort. 


The  characteristics  which  make  COBOL  programs  as  long  running 
as  they  often  are  suggested  our  method  of  speeding  up  such  programs. 

A  typical  program,  judging  from  the  examination  of  a  number  of  pro- 

* 
grams,  involves  wery   little  processing  of  the  data  compared  to,  say, 

a  FORTRAN  numerical  program.  Commonly,  a  set  of  input  data  is  ac- 
quired, particular  items  are  selected,  simple  calculations  (if  any) 
are  carried  out,  then  the  data  is  reformatted  and  written  out,  and 
another  set  of  data  is  acquired  for  processing.  While  the  amount  of 
processing  done  on  each  set  of  data  is  relatively  small,  the  number  of 
sets  of  data  processed  per  run  may  be  quite  large. 

This  profile  of  a  typical  COBOL  program  suggests  three 
things.  First,  any  arithmetic  speedup  we  can  obtain  is  useful, 
although  it  may  not  be  as  dramatic  as  that  in  a  numerical  program. 
Second,  since  much  of  the  work  in  a  COBOL  program  involves  manipulating 
data,  it  seems  desirable  to  build  these  capabilities  into  the  memory 
[ST070]  where  this  would  avoid  transferring  large  amounts  of  data  back 
and  forth  to  special  processors.  Finally,  our  greatest  speed  improve- 
ment can  be  expected  to  come  from  overlapping  the  processing  of  the 
sets  of  input  data. 

1 .3  Assumptions  and  Restrictions 

In  attacking  any  problem  of  the  potential  magnitude  of  this 
one,  it  is  necessary  to  make  some  assumptions  about  the  environment  of 


Our  samples  are  described  in  Chapter  6,  "Experimental 
Results." 


the  proposed  solution  and  to  set  bounds  on  the  degree  to  which  we  are 
willing  to  modify  the  original  COBOL  programs. 

An  important  consideration  is  the  hardware  available  for  im- 
plementing the  machine.  We  show  in  Chapter  7  that  the  machine  proposed 
in  this  paper  could  be  built  with  components  which  are  currently  avail- 
able or  are  within  the  capabilities  of  the  current  state-of-the-art. 
In  that  chapter  it  is  also  indicated  where  capabilities  which  are  not 
yet  available  could  be  put  to  good  use. 

Another  consideration  is  the  software  with  which  the  compiler 
is  implemented.  The  compiler  algorithms  presented  in  this  paper  are 
intended  to  demonstrate  the  feasibility  of  concurrent  record  process- 
ing. In  an  actual  implementation  of  these  algorithms  we  assume  that 
the  implementer  would  use  techniques  which  take  advantage  of  the  capa- 
bilities of  the  machine  for  which  he  was  designing  the  compiler.  Such 
capabilities  would  include  the  ability  to  execute  more  than  one  instruc- 
tion stream  at  a  time. 

For  the  purposes  of  this  thesis  we  assume  no  extensions  to 
COBOL  to  facilitate  the  solution  of  the  problem,  although  we  suggest 
some  extensions  in  Chapter  9  that  might  be  useful.  We  attack  the 
problem  of  speeding  up  COBOL  programs  which  are  presented  to  us  as 
they  are  now  written  for  single  processor  sequential  machines.  This  is 
done  for  several  reasons.  First,  concocting  parallel-COBOL  test  pro- 
grams for  parallel  machines  could  result  in  a  yery   small  set  of  test 
programs  which  is  not  representative  of  real  data  processing  programs. 
Second,  were  a  machine  such  as  the  one  proposed  in  this  paper  actually 
put  into  use,  it  would  have  to  be  able  to  handle  the  huge  number  of 


previously  existing  programs  in  some  way,  or  else  force  the  users  to 
rewrite  all  of  their  programs.  While  many  programs  should  be  rewritten 
to  make  best  use  of  the  abilities  of  the  machine,  we  show  in  this  paper 
that  many  programs  intended  for  a  sequential  machine  can  be  made  to  run 
well  on  a  concurrent  machine  without  requiring  the  user  to  modify  the 
programs.  Third,  while  there  is  a  constant  quest  for  higher  throughput 
rates  for  business  computers,  a  language  extension  which  could  improve 
throughput  but  complicates  programming  is  likely  to  be  shunned.  Quite 
often  in  a  business  data  processing  environment  the  efficiency  of  a 
program  is  less  important  than  the  ease  with  which  a  programmer,  unfa- 
miliar with  the  program,  can  make  changes  in  it.  To  bring  about  the 
kind  of  speed  improvement  possible  in  a  concurrent  machine,  it  is  not 
necessary  to  complicate  a  program  by  requiring  that  the  programmer 
insert  additional  instructions  to  control  instruction  sequencing. 
These  additional  instructions  can  be  inserted  by  the  compiler  and  need 
not  appear  in  the  language.  Finally,  because  of  the  wide  use  of  COBOL 
and  the  normal  conservative  tendency  of  people  to  resist  change,  any 
language  radically  different  from  COBOL  would  be  slow  to  gain  accep- 
tance among  business  programmers. 

In  attempting  a  solution  to  an  interesting  problem,  it  is 
easy  to  get  carried  away  and  to  propose  grandiose  schemes  which  would 
be  costly  to  implement  and  might  have  relatively  little  applicability 
to  real  programs.  To  avoid  this  pitfall  we  limit  ourselves  to  adding 
code  to,  and  rearranging  the  code  in  the  original  COBOL  program.  We 
do,  of  course,  make  use  of  the  hardware  we  must  introduce.  We  do  not 
attempt  to  transform  the  algorithm  used  in  the  program  by  attempting 


to  discover  the  programmer's  intentions  and  implementing  them  in  a 
better  way  than  he  did.  Besides  the  obvious  pitfalls  inherent  in 
attempting  to  outprogram  the  programmer,  such  an  approach  could  lead 
to  a  very   expensive  system  whose  execution  speedup  was  offset  by  a  yery 
long  compilation  time.  Since  programs  are  constantly  being  revised, 
compilation  cost  is  not  to  be  ignored.  Likewise,  we  do  not  try  to 
restructure  the  data  files  into  forms  more  amenable  to  concurrent 
processing.  Again,  the  cost  of  restructuring  each  file  on  every  run 
to  suit  the  needs  of  each  program  could  eliminate  any  benefits  derived 
from  the  resulting  faster  execution.  We  do,  in  Chapter  9,  point  out 
programming  techniques  and  file  structures  which  are  good  for  concur- 
rent processing  and  should  be  used  by  programmers  in  programming  our 
machine. 
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2.  DESCRIPTION  OF  THE  METHOD 

The  technique  described  in  this  thesis  achieves  program 
speedup  by  concurrently  processing  as  many  input  records  as  possible, 
while  interlocking  the  processing  to  preserve  any  sequentially  which 
is  essential  to  the  correct  operation  of  the  program.  This  is  done  by 
starting  the  processing  of  a  record  of  input  data  as  soon  as  it  is 
known  which  READ  statement  in  the  program  is  the  next  to  be  executed 
(i.e.:  when  it  is  known  what  processing  is  to  be  done  on  the  next 
record).  At  those  points  in  the  program  at  which  a  sequential  execu- 
tion constraint  exists,  processing  is  suspended  until  the  condition 
inhibiting  further  processing  has  been  removed. 

To  indicate  how  this  technique  works,  consider  Figure  2.1. 
Figure  2.1(a)  shows  a  program  written  for  a  single  processor  machine. 
A  record  is  read  in  block  1  to  obtain  values  for  A,  B,  and  C.  A  test 
is  made  in  block  3  which  compares  A  with  its  preceding  value.  If  A 
satisfies  the  test,  X  is  computed  in  block  4  and  written  out  in 
block  5.  In  block  6  the  value  of  A  is  saved.  Figure  2.1(b)  shows  the 
same  program  after  we  have  inserted  instructions  to  overlap  processing 
of  different  input  records  and  to  provide  the  necessary  interlocking 
between  the  concurrent  processing.  Block  b  causes  the  next  record  to 
be  read  as  soon  as  the  decision  at  block  3  is  made  to  remain  in  the 
1-2-3-4-5-6  loop.  Block  d  breaks  this  loop  by  releasing  the  hardware 
used  to  process  a  record.  Block  a  tests  an  interlock  indication  to  be 
certain  that  the  correct  value  of  OLD-A  is  used  in  block  3.  Block  c 
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Figure  2.1  (continued) 
Concurrent  Execution  Example 
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releases  the  interlock  as  soon  as  OLD-A  has  been  assigned  its  proper 
value  for  the  next  record's  processing.  Block  e  is  used  to  guarantee 
that  the  processing  occurring  beyond  that  block  is  not  entered  until 
all  of  the  processing  of  preceding  records  is  completed. 

2.1  Types  of  Hardware  Units  Needed 

To  implement  the  speedup  technique  discussed  in  this  paper 
we  need  the  following  hardward  units: 

1)  Multiple  processors  are  needed.  By  the  word 
"processor"  we  do  not  mean  a  complete  Central 
Processing  Unit.  A  processor,  as  referred  to  in 
this  paper,  is  either  an  Arithmetic  Unit,  a 
Conditional  Branch  Tree  (IF  Tree)  Processor,  or 
other  type  of  special  purpose  unit. 

a)  In  view  of  the  fact  that  there  may  be  many 
records  in  process  concurrently,  we  expect 
enough  demand  for  computation  to  require  a 
number  of  Arithmetic  Units,  even  if  there  is 
very  little  arithmetic  per  record. 

b)  A  Conditional  Branch  Tree  Processor  [DAV72a] 
is  a  device  which  accepts  as  input  the  results 
of  a  collection  of  comparisons  as  a  set  of 
Boolean  values,  and  returns  the  identity  of 
the  path  to  be  taken  for  subsequent  processing. 
It  has  been  shown  [DAV72b]  that  such  a  processor 
can  select  the  appropriate  exit  point  for  up  to 
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an  eight  level  tree  of  IF  statements  in  about 
two  major  clock  cycles.  The  formation  of 
IF  Trees  is  discussed  in  section  3.1. 
c)  Other  types  of  processors  include  a  unit  used 
to  sort  files,  such  as  that  described  by 
Batcher  [BAT68],  and  a  collection  of  I/O 
processors. 

2)  To  attain  the  necessary  memory  bandwidth  to  satisfy 
the  demands  of  a  number  of  processors  for  data,  we 
need  a  number  of  memory  units.  In  addition,  more 
memory  units  are  needed  to  hold  the  program  being 
executed.  We  propose  to  separate  the  data  memory 
from  the  program  memory  so  that  there  is  no  inter- 
ference between  them.  This  also  allows  the  design 
of  the  program  memory  to  take  advantage  of  the 
fact  that  fetches  of  instructions  tend  to  be  from 
locations  relatively  close  together  in  memory 
[C0F72].  The  data  memory  can  be  designed  to 
include  the  capability  of  doing  some  types  of  pro- 
cessing, such  as  replacing  all  occurrences  of  one 
character  by  another,  without  the  need  to  send  the 
data  to  another  unit  for  processing. 

3)  To  allow  any  processor  to  fetch  data  from  any 
memory  and  to  allow  transfers  of  data  from  any 
memory  to  any  other,  we  need  some  sort  of  Routing 
Network. 
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4)  To  control  instruction  sequencing,  an  Address 
Counter  is  needed  for  each  record  being  concur- 
rently processed.  When  we  reach  a  point,  during 
the  execution  of  a  program,  at  which  another 
input  record  could  begin  to  be  processed,  we 
activate  a  previously  inactive  Address  Counter. 
It  then  fetches  instructions  to  perform  the 
processing  for  the  next  input  record.  When  an 
Address  Counter  completes  the  processing  of  a 
set  of  input  data,  it  is  deactivated  and  returns 
to  a  pool  of  units  available  for  assignment  to 
subsequent  records. 

2.2  Address  Counters  and  Interlocks 

Since  the  use  of  multiple  Address  Counters  is  a  key  part  of 
this  method  of  speeding  up  a  program's  execution,  we  examine  here  the 
starting  and  stopping  of  Address  Counters  and  the  constraints  under 
which  they  must  operate. 

To  start  an  inactive  Address  Counter  into  activity,  an 
Address  Counter  which  is  already  active  executes  a  FORK  instruction. 
One  of  the  operands  of  the  FORK  instruction  is  the  program  location  at 
which  the  new  Address  Counter  is  to  begin  execution,  which  we  call  the 
initiation  point  for  that  FORK. 

While  active,  the  jobs  of  the  Address  Counter  are  instruc- 
tion sequencing  and  address  calculation.  An  Address  Counter  fetches 
instructions,  computes  the  effective  data  addresses,  and  passes  the 
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instruction  on  to  the  rest  of  the  machine  for  execution  until  one  of 
the  following  conditions  arises: 

1)  A  conditional  branch  instruction  is  encountered. 
In  this  case  the  Address  Counter  generates  the 
request  for  the  evaluation  of  the  condition,  then 
awaits  the  result.  It  resumes  execution  at  the 
address  computed  from  the  information  supplied  by 
the  IF  Tree  Processor. 

2)  A  QUIT  instruction  is  encountered. 

3)  A  HOLD  instruction  is  encountered.  Either  of  two 
things  causes  the  instruction  stream  to  be  resumed. 
If  the  Address  Counter  was  the  only  one  active,  it 

is  signalled  to  resume  execution  at  the  next  instruc- 
tion. If  other  Address  Counters  are  still  active, 
this  Address  Counter  halts  as  though  it  had  encoun- 
tered a  QUIT  instruction.  The  last  active  Address 
Counter,  after  it  executes  a  QUIT,  resumes  execu- 
tion where  the  one  executing  the  HOLD  was  halted. 

4)  An  instruction  is  encountered  which  causes  a  value 
to  be  transferred  from  one  of  the  data  memory  units 
into  one  of  the  Address  Counter's  index  registers. 
Execution  resumes  at  the  next  instruction  after  the 
value  is  received. 

5)  An  instruction  testing  an  interlock  is  encountered. 
Execution  resumes  at  the  next  instruction  when  the 
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Address  Counter  is  signalled  that  the  interlocking 

condition  has  been  removed. 
An  Address  Counter  is  restricted  to  only  two  classes  of  addressing  in 
the  calculation  of  data  addresses.  The  first  is  the  class  of  addresses 
which  all  Address  Counters  can  access—those  corresponding  to  common 
variables.  The  second  is  the  class  of  addresses  which  only  a  single 
Address  Counter  can  access—those  corresponding  to  private  variables. 
To  accomplish  this  separation  of  data  we  conceptually  partition  memory 
into  one  area  common  to  all  Address  Counters,  and  a  number  of  private 
areas,  each  accessable  to  only  one  Address  Counter.  By  using  base 
registers  containing  the  appropriate  base  addresses  for  the  partitions 
allowed  to  particular  Address  Counters,  and  by  making  the  layout  of 
each  copy  of  the  private  areas  the  same,  we  can  easily  implement  this 
partitioning. 

In  order  to  insure  that  the  results  of  a  program  executed 
using  our  concurrent  record  processing  technique  are  correct,  three 
types  of  interlocks  are  needed  in  a  program. 

1)  Those  required  to  insure  that  instructions  in  the 
same  instruction  stream  which  access  the  same 
variable  are  executed  in  the  correct  sequence. 

2)  Those  required  to  protect  common  variables  which 
can  be  modified  by  any  instruction  stream  at  any 
time.  It  is  necessary  in  this  case  to  make  sure 
that  only  one  instruction  stream  accesses  such  a 
variable  at  a  time. 
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3)  Those  required  to  protect  variables  for  which 
sequential  execution  constraints  exist.  These 
variables,  such  as  OLD-A  in  Figure  2.1,  must  not 
be  accessed  by  an  instruction  stream  until  the 
preceding  instruction  stream  is  no  longer  able  to 
access  them. 
The  first  type  of  interlock  is  obtained  if  we  do  not  allow  an 
instruction  to  go  into  execution  until  all  of  its  operands  are  available 
for  use.  We  show  in  Chapter  5  that  this  operation,  and  the  others 
necessary  to  handle  this  interlock  problem,  can  be  handled  by  a  hard- 
ware unit  we  call  an  Instruction  Dispatch  Unit. 

For  the  second  type  of  interlock,  we  could  associate  a  bit 
with  each  such  variable  and  use  it  as  a  semaphore  [DIJ65].  It  turns 
out,  however,  that  the  Instruction  Dispatch  Unit  intended  to  handle  the 
first  type  of  interlock  problem  also  solves  the  second  type  of  inter- 
lock problem.  It  should  be  noted  that  no  work  by  the  compiler  is 
needed  for  either  of  these  interlocks. 

The  third  type  of  interlock  problem  does  require  both  com- 
piler algorithms  and  hardware  to  handle  it.  First,  the  variables  which 
require  this  type  of  interlock  must  be  identified.  These  variables  are 
those  which  must  be  accessed  by  Address  Counters  (or,  equivalently, 
instruction  streams)  in  the  order  in  which  the  Address  Counters  are 
activated.  Then  those  blocks  of  code  (nodes  in  the  program  graph) 
which  contain  references  to  these  variables  must  be  identified.  Finally 
the  compiler  must  insert  instructions  in  appropriate  places  to  test  and 
to  release  interlock  indicators  for  each  of  these  interlocked  variables. 
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These  indicators  are  included  in  the  circuitry  of  the  Address  Counter 
Coordinator.  We  could  implement  this  type  of  interlock  by  constructing 
an  unlocking  function  attached  to  the  variable,  but  this  could  lead  to 
a  problem.  The  simpler  interlocks  pose  no  problem  with  respect  to 
degrading  performance  by  tying  up  resources  while  an  instruction  waits 
for  access  to  a  variable.  The  reason  for  this  is  that  the  expected 
length  of  the  wait  should  be  relatively  short.  For  this  third  type  of 
interlock  it  is  not  sufficient  that  the  variable  is  not  being  accessed; 
it  must  no  longer  be  able  to  be  accessed  by  a  given  Address  Counter's 
predecessor  in  order  for  that  Address  Counter  to  be  able  to  access  the 
variable.  The  length  of  the  wait  for  that  condition  to  be  satisfied 
could  be  quite  long,  especially  if  there  are  many  active  Address 
Counters.  With  the  interlock  attached  to  the  variable  we  could  have 
several  statements  per  Address  Counter  which  are  half  executed  waiting 
for  locked  variables,  with  intermediate  results  and  unfi liable  fetch 
requests  tying  up  a  great  deal  of  hardware.  Instead,  we  do  not  allow 
a  block  to  enter  execution  until  all  of  the  interlocks  attached  to  that 
block  are  satisfied.  Equipment  is  thus  available  to  handle  the  pro- 
cessing for  the  Address  Counters  which  are  not  locked  out  of  their  data, 
and  we  avoid  a  possible  deadlock  producing  condition. 
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3.  COMPILER  ALGORITHMS 

This  chapter  is  intended  to  demonstrate  the  feasibility  of  our 
speedup  technique  by  presenting  a  set  of  algorithms  which  could  be  used 
to  implement  this  process.  The  algorithms  are  presented  in  the  same 
order  that  they  could  appear  in  a  compiler. 

3.1  Source  Text  Scan 

The  two  sections  of  a  COBOL  program  which  provide  the  bulk  of 
the  information  in  which  we  are  interested  are  the  Data  Division  and  the 
Procedure  Division,  as  they  are  named  in  the  language  [IBM72].  The 
former  describes  the  attributes  of  the  files  used  by  the  program  and 
defines  all  of  the  variables  used  in  the  program.  The  Procedure 
Division  contains  the  executable  instructions  of  the  program. 

During  the  scanning  of  the  program  we  need  to  collect  the 
following  information  in  addition  to  that  normally  collected  by  a  com- 
piler for  a  sequential  machine.  No  great  difficulty,  however,  is 
entailed  in  accumulating  this  information  since  it  is  readily  available 
during  the  usual  scanning  process. 

1)  For  each  statement  in  the  program  we  build  two  sets 
of  variables: 

a)  The  identities  of  those  variables  fetched  for 
use  in  the  statement  comprise  the  input  set, 
or  set  of  input  variables,  for  the  statement. 
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b)  The  identities  of  those  variables  whose  values 
are  set  by  the  execution  of  the  statement 
comprise  the  output  set,  or  set  of  output  vari- 
ables, for  that  statement. 
2)  A  graph  of  the  control  flow  of  the  program  is  built 
from  the  contents  of  the  Procedure  Division,  as  is 
often  done  for  purposes  of  optimizing  code.  Each 
statement  in  the  program  is  represented  by  a  single 
node  in  the  graph  except  for  the  following  special 
cases: 

a)  Contiguous  assignment  statements,  such  as 
arithmetic  and  MOVE  statements  are  lumped 
together  in  a  single  node,  so  long  as  no  other 
type  of  statement  or  an  intervening  label  is 
encountered.  Thus,  a  single  node  in  the  program 
graph  is  generated  for  a  block  of  assignment 
statements. 

b)  PERFORM  statements  are  expanded  wherever  possi- 
ble. Where  the  PERFORM  statement  simply  calls 
for  a  single  execution  of  a  section  of  the  pro- 
gram, that  section  is  copied  into  the  program 
in  place  of  the  PERFORM  statement.  Where  there 
is  a  fixed  number  of  iterations  specified  in 
the  PERFORM  statement,  the  performed  section  of 
code  is  replicated  with  the  appropriate  value 
of  the  iteration  variable  inserted  for  each 
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replication.  For  more  complex  PERFORM  state- 
ments the  performed  section  of  code  is  copied 
in  place  of  the  PERFORM  statement  and  imbedded 
in  a  construct  similar  to  the  PL/1  DO  block. 
These  blocks  are  handled  by  a  compiler  as  in 
the  FORTRAN  analyzer  described  by  Kuck  et  al 
[KUC72b]. 
3.  For  each  variable  used  in  the  program  two  sets  have 
to  be  constructed. 

a)  The  set  of  input  references  is  the  set  of 
statements  for  which  the  variable  appears  as 
an  input  variable. 

b)  The  set  of  output  references  is  the  set  of 
statements  for  which  the  variable  appears  as 
an  output  variable. 

The  compiler  can  do  two  things  during  the  construction  of  the 
internal  data  base  that  can  yield  a  speedup  at  little  additional  cost. 

The  first  is  to  forward  substitute  [KUC72b]  within  a  block  of 
assignment  statements.  In  this  technique,  any  occurrence  of  an  output 
variable  as  an  input  variable  in  a  subsequent  statement  is  replaced  by 
the  expression  in  the  assignment  statement  for  that  output  variable. 
For  example,  the  block  of  assignment  statements 

A  +   B  +  C  +  D 
E  +■  A  +  F 
G  «-  H  +  E  +  A 
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would  become,  after  forward  substitution 

A  +   B  +  C  +  D 

E^B  +  C  +  D  +  F 

G«-H  +  B  +  C  +  D  +  F  +  B  +  C  +  D 
In  the  latter  case,  unlike  the  former,  there  are  no  interdependences 
between  the  statements  in  the  block,  so  that  all  three  statements  could 
be  executed  in  parallel. 

The  second  thing  the  compiler  can  do  during  this  phase  to 
improve  execution  speed  is  to  form  IF  Trees.  In  this  technique  we 
combine  individual  IF  statements  into  a  tree  structure  which  can  be 
executed  by  an  IF  Tree  Processor.  Unlike  Davis  [DAV72a,  DAV72b], 
however,  when  we  are  building  an  IF  Tree,  we  do  not  move  all  assignment 
statements  from  within  the  tree  upward  to  a  point  ahead  of  the  tree. 
Instead,  we  move  upward  all  statements  upon  which  the  execution  of  the 
conditional  branches  in  the  tree  depend.  All  other  statements  are 
moved  down  to  be  collected  at  the  exits  from  the  tree.  This  is 
illustrated  in  Figure  3.1.  Figure  3.1(a)  shows  a  collection  of 
conditional  branch  statements  with  assignment  statements  occurring 
between  them.  Figure  3.1(b)  shows  the  IF  Tree  and  associated  assign- 
ment blocks  we  form.  The  conditions  have  been  transformed  into 
assignments  of  logical  values  to  a  set  of  temporary  variables  &1,  &2, 
and  &3  which  we  call  the  conditional  result  set.  This  result  set  is 
used  by  the  IF  Tree  Processor  to  determine  which  exit  is  to  be  used. 
The  identity  of  the  exit  is  then  used  by  an  Address  Counter  to  select 
the  next  instruction  to  be  executed. 
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3.2  Phase  and  Link  Identification 

A  program  typically  consists  of  a  collection  of  loops  con- 
nected by  code  which  is  not  included  in  the  loops.  There  can,  of 
course,  be  many  loops  within  the  outer  loops.  In  terms  of  the  program 
graph,  we  define  a  phase  to  be  a  maximal  strongly  connected  set  of  nodes 
[CHE71b].  That  is,  a  phase  is  defined  in  such  a  way  that  any  node  in 
the  phase  can  be  reached  from  any  node  in  the  phase  (including  itself) 
by  way  of  some  directed  path  in  the  program  graph.  Any  node  not  found 
in  a  phase  is  in  a  link.  In  terms  of  program  execution,  control  remains 
within  a  phase  until  a  link  is  entered.  Once  a  link  is  entered,  control 
never  re-enters  the  exited  phase. 

We  are  particularly  interested  in  phases  for  a  couple  of 
reasons.  Obviously,  the  address  mapping  used  for  variables  referenced 
within  a  phase  must  be  invariant  within  the  phase  to  avoid  ambiguities 
in  calculating  data  addresses.  Different  mappings  can  be  used  in  dif- 
ferent phases.  More  importantly,  we  are  concentrating  our  efforts  on 
speeding  up  the  execution  of  a  phase  rather  than  of  a  link  because  the 
link  is  executed  no  more  than  once,  while  the  number  of  times  the  code 
in  a  phase  is  executed  is  potentially  \/ery   large. 

Identification  of  phases  is  simply  the  problem  of  identifying 
maximal  strongly  connected  subgraphs  in  the  program  graph.  An  algorithm 
for  this  problem  has  been  given  by  Ramamoorthy  [RAM66],  and  a  more  effi- 
cient technique  has  been  found  by  Chappell  [CHA69]. 


24 


3.3  Statement  Migration 

It  has  been  found  during  our  analyses  that  a  program  can  be 
prevented  from  being  sped-up  as  much  as  possible  because  the  programmer 
happened  to  code  a  crucial  instruction  at  a  point  late  in  the  program, 
while  it  was  actually  possible  to  have  placed  the  instruction  earlier 
in  the  instruction  stream.  On  a  sequential  machine  this  is  no  problem. 
When  such  an  instruction  involves  the  assignment  of  a  value  to  an  inter- 
locked variable,  however,  this  causes  the  associated  interlock  on  our 
concurrent  machine  from  being  released  as  early  as  it  could  be  released. 
This  likewise  unnecessarily  delays  the  processing  of  data  by  succeeding 
Address  Counters. 

Consider,  for  example,  Figure  3.2.  The  loops  in  Figure  3.2 
are  identical  except  for  the  location  of  the  assignment  of  SEQ  to  OSEQ. 
In  Figure  3.2(a),  if  an  Address  Counter  is  waiting  to  execute  the  condi- 
tional branch,  it  cannot  be  allowed  to  proceed  until  its  predecessor  has 
executed  the  assignment  of  DATA  to  WDATA,  written  out  WDATA,  and 
assigned  OSEQ  the  proper  value.  In  Figure  3.2(b),  the  assignments  can 
both  take  place  concurrently,  thus  requiring  a  shorter  wait  by  an 
Address  Counter  before  the  value  of  OSEQ  is  set.  Since  the  speedup 
attainable  in  a  situation  such  as  that  in  Figure  3.2(b)  is  potentially 
much  greater  than  in  one  such  as  that  in  Figure  3.2(a),  it  is  worth  our 
while  to  reorder  statements  so  that  they  are  executed  as  early  as 
possible. 

Because  we  migrate  statements,  instructions  whose  operands  are 
not  available  until  late  in  the  instruction  stream  tend  to  be  placed 
after  instructions  whose  operands  are  available  earlier.  Because  of 
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this  ordering,  statements  do  not  usually  wait  for  a  long  time  in  the 
Instruction  Dispatch  Unit  for  their  operands  to  become  available,  thus 
reducing  the  amount  of  queue  space  required  in  that  unit. 

When  we  migrate  statements,  we  only  move  those  statements 
which  change  the  values  of  variables.  Such  statements  as  IF  and  WRITE 
should  not  be  moved. 

The  algorithm  which  follows  is  a  modified  version  of  one 
reported  by  Foster  and  Riseman  [F0S72a,  F0S72b]. 

Algorithm  3.1  -  Statement  Migration 

1)  Start  at  the  head  node,  corresponding  to  the  first 
entry  point,  of  the  program  graph. 

2)  Compute  the  earliest  possible  dispatch  time,  t ., 
for  each  of  the  output  variables.  This  is  done  by 
computing  the  execution  time,  t  ,  for  the  statement 
(minimum  tree  height  for  blocks  of  assignment  state- 
ments in  [KUC72b])  and  finding  the  maximum  of  the 
dispatch  times  for  the  input  variable  set 

tm  =  max  itd(input  variables)t  • 


Then 


t .  =  t  +  t  . 
d    e    m 


The  dispatch  time  for  a  variable  is,  thus,  the 
earliest  time  along  a  particular  path  in  the  pro- 
gram graph  that  the  variable  is  available  for  use 
as  an  input  variable. 
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3)  If  the  node  under  consideration  is  the  destination 
of  one  or  more  branch  instructions,  examine  the 
locations  of  all  branches  to  this  node.  There  are 
two  possibilities  for  each.  Either  the  branch  is 
looping  back  to  an  earlier  point  on  the  path 
reaching  it,  or  the  branch  causes  the  reconvergence 
of  paths  which  separated  at  an  earlier  point  in  the 
processing.  In  the  first  case  we  do  not  attempt 
further  migration  since  we  could  end  up  moving  this 
block  endlessly  around  the  loop  without  real  gain. 
If  all  of  the  paths  are  reconvergent,  we  attempt  to 
carry  out  the  migration  process  along  all  of  the 
paths.  If  we  are  able  to  migrate  a  statement  up 
any  of  the  paths,  then  we  move  the  statement  into 
the  other  paths  as  well.  In  Figure  3.3(b)  the 
assignment  statement  G  ■*-  B  +  D  can  be  migrated 
farther  up  the  first  two  paths,  but  it  cannot  be 
migrated  farther  up  the  third  path  because  of  the 
conflict  between  the  input  variable  set  of  the 
WRITE  statement  and  the  output  variable  set  of  the 
assignment  statement,  as  discussed  in  step  (5) 
below. 

4)  If  the  node  prior  to  the  current  node  corresponds 
to  a  conditional  branch,  we  can  migrate  a  state- 
ment upward  past  the  conditional  branch  only  if 
the  same  statement  appears  at  all  destinations  of 
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Figure  3.3 
Statement  Migration  Example  2 
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the  branch.  In  Figure  3.4(a)  the  assignment  to 
A  of  B  +  C  occurs  on  all  paths  leading  from  the 
conditional  branch.  Also,  as  discussed  in 
step  (5),  there  is  no  conflict  between  the  output 
set  of  this  assignment  and  the  input  set  of  the 
conditional  branch  instruction.  Thus,  this  assign- 
ment statement  can  be  migrated  up  past  the  condi- 
tional branch  as  shown  in  Figure  3.4(b).  In  the 
same  example,  two  paths  from  the  conditional  branch 
contain  the  assignment  of  -3  to  E.  The  third 
branch,  however,  does  not  contain  this  statement. 
Until  the  conditional  branch  is  executed,  it  is 
not  known  which  value  E  takes  on.  This  prevents 
us  from  migrating  this  assignment  statement. 
5)  For  assignment  statements  there  are  two  cases  to 
consider.  The  block  of  assignment  statements  may 
be  preceded  by  another  block  of  assignment  state- 
ments or  by  some  other  type  of  block.  If  the 
preceding  block  is  another  block  of  assignment 
statements,  the  two  blocks  should  be  concatenated, 
subject  to  the  constraints  in  step  (3).  If  the 
preceding  block  is  of  another  type,  an  assignment 
can  be  moved  ahead  of  the  predecessor  if  the 
following  are  true: 

a)  The  dispatch  time  of  the  assignment  state- 
ment is  less  than  that  of  the  predecessor. 
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b)     The  relation 

(I.  n     0.)    »     (0.    n     I..)    u     (0.    n    Oj)  =    cf>  (3.1) 

is  satisfied  [BER66,  RUS69],  where  Ii  and  0.. 
are  the  input  and  output  variable  sets  for  the 
assignment  statements,  I.  and  0.  are  the  input 
and  output  variable  sets  for  the  predecessor, 
and  tj)  denotes  the  empty  set.  With  test  (a) 
having  been  performed,  the  test 
0.  -  I .  =  * 

is  redundant.  Relation  3.1  thus  reduces  to 

0.    n  (I.  u  0.)  -  <J>  .  (3.2) 

6)  If  the  movable  node  under  consideration  corresponds 
to  other  than  a  block  of  assignment  statements,  it 
can  be  moved  ahead  of  its  predecessor  if  the  follow- 
ing are  satisfied: 

a)  Its  dispatch  time  is  less  than  that  of  all 
statements  in  the  previous  block. 

b)  Relation  3.2  is  satisfied,  where  I.  and  0.  are 
input  variable  sets  for  the  movable  node,  and 
0.  is  the  output  variable  set  for  the 
predecessor, 

7)  If  any  assignment  statements  were  moved  in  step  (5), 
forward  substitute  them  if  possible  in  their  new 
position.  Then  continue  working  them  up  the  path 
starting  at  step  (2). 
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8)  If  the  statement  moved  is  not  an  assignment  state- 
ment, continue  migration  with  step  (3). 

9)  If  no  migration  was  done  in  steps  (5)  or  (6), 
attempt  migration  starting  at  step  (2)  with  the 
next  node  which  has  not  been  examined  for  migra- 
tion on  the  path.  At  a  conditional  branch,  take 
one  of  the  paths  emanating  from  it  and  put  the 
identities  of  the  initial  nodes  of  the  rest  of 
the  paths  into  a  queue. 

10)  If  there  is  no  further  node  on  the  path  which 

has  not  been  examined  for  migration  possibilities, 
take  the  next  node  from  the  queue  built  in 
step  (9). 

11)  If  the  queue  is  empty,  the  algorithm  is  completed. 

3.4  Variable  Type  Identification 

We  next  separate  the  set  of  variables  referenced  in  a  phase 
into  four  classes.  Identifying  these  classes  of  variables  accomplishes 
two  things.  First,  it  identifies  those  for  which  interlock  instructions 
must  be  generated.  These  variables  are  required  to  be  accessed  by 
instruction  streams  in  the  order  in  which  the  streams  are  started  into 
execution,  that  is,  variables  for  which  sequential  execution  constraints 
exist.  Second,  we  can  identify  the  private  and  common  sets  of  variables. 
The  former  have  to  be  provided  for  each  Address  Counter,  while  the  latter 
are  shared  by  all  Address  Counters. 

The  four  classes  in  which  we  are  interested  are  the  following: 
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1)  Constants.  These  are  storage  locations  whose  values 
are  set  outside  of  the  phase  under  consideration. 
Constants  do  not  impose  sequential  execution  con- 
straints because  they  are  never  assigned  values 
during  the  phase.  Thus  they  may  be  accessed  by 
Address  Counters  in  any  order. 

2)  Local  variables.  During  the  execution  of  a  phase 
a  separate  copy  of  each  of  these  variables  is 
maintained  by  each  of  the  active  Address  Counters. 
Separate  copies  are  needed  since  these  variables 
include  the  ones  being  used  to  contain  the  data 
from  several  records  undergoing  processing 
concurrently.  Local  variables  do  not  impose 
sequential  execution  constraints  since  each 
Address  Counter  has  its  own  copy  of  the  Local 
variable  set  for  the  phase  and  no  Address  Counter 
can  change  the  value  of  another  Address  Counter's 
Local  variables. 

3)  Reference  Independent  variables.  The  remaining 
variables  in  the  program  are  shared  by  all  Address 
Counters  active  in  the  phase.  All  of  them  must  be 
protected  by  interlocks  to  guarantee  that  they 
have  correct  values  when  referenced.  The  Reference 
Independent  variables  have  less  stringent  require- 
ments for  their  use  than  the  Reference  Dependent 
variables  described  below.  Reference  Independent 
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variables  characteristically  are  modified  during 
the  phase,  but  their  values  do  not  influence  the 
choice  of  paths  through  the  program.  For  example, 
a  counter  which  is  incremented  for  each  input 
record  read  is  of  this  type.  There  is  no  sequential 
execution  constraint  generated  by  the  presence  of 
a  Reference  Independent  variable  in  a  phase  since 
the  only  place  the  value  can  be  tested  is  beyond 
the  range  of  the  phase.  Only  the  final  value  of 
the  variable  must  be  correct;  intermediate  values 
are  never  examined.  We  must,  of  course,  require 
that  only  one  Address  Counter  at  a  time  have 
access  to  the  variable,  but  this  can  be  implemented 
without  including  interlock  instructions  in  the 
program  by  using  the  Instruction  Dispatch  Unit 
described  in  section  5.4. 
4)  Reference  Dependent  variables.  These  are  the  vari- 
ables for  which  sequential  execution  constraints 
exist.  Included  in  this  class  of  variables  are  the 
files  used  in  the  phase.  No  Address  Counter  is 
allowed  to  access  one  of  these  variables  until  the 
nearest  active  predecessor  to  that  Address  Counter 
is  no  longer  able  to  access  that  variable.  The 
following  section  of  COBOL  code  demonstrates  the 
need  for  interlocks  on  these  variables: 
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1  MOVE  DATA  INTO  PRINT-LINE. 

2  IF  LINE-COUNTER  >  60  THEN 

3  WRITE  PRINT-FILE  FROM  PAGE-HEADER 

4  AFTER  POSITIONING  NEW-PAGE  LINES, 

5  MOVE  ZERO  TO  LINE-COUNTER. 

6  WRITE  PRINT-FILE  FROM  PRINT-LINE 

7  AFTER  POSITIONING  1  LINES. 

8  ADD  1  TO  LINE-COUNTER. 

Since  the  variable  LINE-COUNTER  is  tested  in  line  2, 
we  must  force  Address  Counters  to  access  LINE-COUNTER 
in  the  order  in  which  the  Address  Counters  were  acti- 
vated, and  make  each  wait  until  its  predecessors  no 
longer  alter  the  value  of  LINE-COUNTER.  Otherwise  it 
could  not  be  guaranteed  that  each  Address  Counter 
would  follow  its  proper  path  through  this  section  of 
code. 
To  accomplish  variable  classification,  we  make  use  of  the  set 

of  input  references,  I,  and  the  set  of  output  references,  0,  for  each 

variable. 

Algorithm  3.2  -  Variable  Type  Identification 

1)  If  I  and  0  are  both  empty,  the  variable  is  not 
referenced  during  the  phase  and  can  be  discarded. 

2)  If  the  variable  is  the  name  of  a  file,  it  is 
considered  a  Reference  Dependent  variable. 


36 


3)  If  the  variable  appears  as  an  argument  in  a  CALL 
statement,  it  is  considered  a  Reference  Dependent 
variable. 

4)  If  0  is  empty,  then  the  variable  is  never  assigned 
a  value  during  the  execution  of  the  phase.  The 
variable  is  thus  a  Constant  during  the  phase. 

5)  If  I  is  empty  or  I  =  0,  then  the  variable  is  never 
used  in  a  conditional  branch  test  (i.e.:  I  contains 
no  elements  not  in  0),  so  that  it  does  not  determine 
the  flow  of  control  in  the  phase.  Thus  the  variable 
is  a  Reference  Independent  variable  for  the  phase. 

6)  If,  for  any  path  through  the  phase  from  one  primary 
READ  statement  to  another  primary  READ  statement, 
the  variable  appears  as  an  input  variable  before  it 
appears  as  an  output  variable,  then  it  is  a 
Reference  Dependent  variable. 

7)  If  none  of  the  above  conditions  is  met,  the  vari- 
able is  a  Local  variable. 

8)  If  any  item  in  a  record  has  been  made  a  Reference 
Dependent  variable,  then  all  the  items  in  that 
record  must  also  be  made  Reference  Dependent  to 
insure  that  the  whole  record  is  available  with  the 
correct  values  assigned  when  it  is  to  be  written 
out.  Otherwise  part  of  the  record  could  be  lost 
when  a  predecessor  to  the  outputting  Address 
Counter  is  deactivated  and  releases  its  storage. 
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Since  one  of  the  conditions  tested  in  this  algorithm  must 
hold  for  each  variable,  but  no  more  than  one  condition  per  variable,  it 
is  possible  to  uniquely  determine  to  which  class  a  variable  should  be 
assigned. 

3.5  Storage  Assignment 

Within  each  block  of  statements  we  would  like  to  fetch  and 
store  all  variables  without  storage  access  conflicts.  We  also  would 
like,  while  avoiding  access  conflicts,  to  assign  source  and  destination 
locations  for  a  data  movement  (COBOL  MOVE  instruction)  to  the  same 
memory  unit  to  avoid  needlessly  using  the  Inter-Memory  Bus.  We  also 
would  like  to  group  elements  of  a  data  structure  which  can  be  fetched 
together  into  the  same  memory  word.  To  accomplish  these  objectives,  we 
assign  variables  to  memory  units  according  to  Algorithm  3.3.  In  this 
algorithm,  the  following  four  sets  of  variables  are  constructed  for 
each  variable  v: 

D  -  The  Data  Division  Affinity  Set.  This  is  the  set  of 

v      J- 

variables  which  appear  in  the  same  record  description 
in  the  Data  Division  of  a  COBOL  program.  By  assign- 
ing v  to  the  same  word  as  an  element  of  D  ,  we  can 
fetch  both  items  in  the  same  memory  cycle,  and  we 
can  also  simplify  the  transfer  of  information 
between  the  Data  Memory  and  the  I/O  Processors. 
A  -  The  Procedure  Division  Affinity  Set.  This  is  the 

set  of  variables  with  which  we  would  like  to  group  v. 
Variables  are  grouped  in  A  because  of  their 


38 


relation  to  v  in  statements  in  the  Procedure 
Division  of  the  program. 

S  -  The  Segregation  Set.  This  set  consists  of  vari- 
ables from  which  we  must  separate  v  in  assigning 
memory  units  to  avoid  access  conflicts. 

Q  -  The  Indefinite  Set.  This  set  consists  of  vari- 
ables which  we  would  like  to  put  into  the  same 
memory  word  as  v  if  possible;  but,  if  it  is  not 
possible,  we  must  place  them  in  separate  memory 
units. 

Algorithm  3.3  -  Storage  Assignment 

1)  Passing  through  the  Data  Division  of  the  program, 
form  D  for  each  variable  v.  Include  in  D  all 
variables  appearing  in  the  same  record  description 
as  v. 

2)  Passing  through  the  Procedure  Division  of  the 
program,  form  the  A,  S,  and  Q  sets  for  each 
variable. 

a)  If  u,  v  e  I  (where  I  is  the  set  of  input  vari- 
ables for  a  block  of  code),  put  u  into  S  . 
However,  if  u  e  (A  u  Dv),  put  u  into  Qy. 
Similarly,  put  v  into  S  ,  unless  v  e  (A  u  Du), 
in  which  case  put  v  into  Q  . 

b)  If  u,  v  e  0  (where  0  is  the  set  of  output 
variables  for  a  block  of  code),  then  put  u  into 
S  and  v  into  S  . 
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c)  If  u  e  I.  and  v  e  0.   (where  I.  and  0.  are 
the  input  and  output  sets  for  statement  i), 
then  put  u  into  A  .  However,  if  u  e  S  ,  put 
u  into  Q  .  Similarly,  put  v  into  A  unless 
v  e  S  ,  in  which  case  put  v  into  Q  . 

3)  For  each  variable  v  examine  A  ,  S  ,  and  Q  . 

a)  If  3    ueO   and  u  e  S  ,  then  delete  u  from 

v 

b)  If  3    u  e  Q   and  u  e  A  ,  then  delete  u  from 

V 

4)  Assign  variables  to  memory  units.  We  do  this  by 
examining  S  ,  A  ,  D  ,  and  Q  for  each  variable  v 
in  turn. 

a)  Arbitrarily  assign  some  variable  to  memory 
unit  1.  A  heuristic  for  selecting  this  vari- 
able is  to  use  the  one  with  the  largest  S  set. 

b)  For  each  variable  u  e  S   and  assigned  to 
memory  unit  m  ,  mark  m  unavailable  to  v. 

c)  If  all  of  the  memory  units  to  which  variables 
have  been  assigned  are  marked  unavailable  to 
variable  v,  assign  v  to  a  previously  unassigned 
memory  unit. 

d)  For  each  variable  u  e  A   and  assigned  to 
memory  m  ,  determine  whether  or  not  m  is 
available  to  v.  Assign  v  to  the  memory,  in 
the  set  of  available  m  units,  which  has  the 
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fewest  words  of  the  appropriate  type  (common 
or  local )  assigned. 

e)  If  v  is  not  assigned  in  step  (b),  then  for 
each  variable  u  e  I   assigned  to  memory 
unit  m  ,  compute 

L  =  length  (v)  +  length  (u1)  . 
We  introduce  u'  to  represent  u  and  all  other 
items  assigned  to  the  same  word  as  u.  That  is, 
if  s  and  t  are  assigned  to  the  same  word, 

s'  =  t'  =  {s,t}  , 
and 

length  (s1)  =  length  (f)  =  length  (s)  +  length  (t) 
If  L  £  length  (1  memory  word),  assign  v  to  the 
same  word  as  u.  Otherwise  mark  m  unavailable 
to  v. 

f)  If  v  is  not  assigned  in  one  of  the  steps  above, 
then  for  each  variable  u  e  D  and  assigned  to 
memory  unit  m  ,  compute 

L  =  length  (v)  +  length  (u1)  . 
If  L  <  length  (1  memory  word),  assign  v  to  the 
same  word  as  u. 

g)  If  no  assignment  is  made  for  a  variable  during 
steps  (a),  (c),  (d),  (e),  or  (f),  mark  the  vari- 
able "unassigned"  and  start  processing  the  next 
variable  at  step  (b). 
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h)  After  the  first  pass  through  the  variables, 
try  again  to  assign  the  variables  marked 
"unassigned"  in  step  (g).  Iterate  this  step 
until  either  all  variables  are  assigned,  or 
until  no  variable  is  assigned  during  the 
iteration. 

i)  If  some  variables  are  still  unassigned, 

assign  them  to  memory  units  in  such  a  way  as 
to  balance  the  number  of  words  used  for  each 
type  (common  and  local)  across  the  memories. 
5)  Calculate  the  address  function  for  each  variable. 

We  do  not  present  an  algorithm  here,  but,  instead, 

present  the  following  remarks  which  are  germane  to 

this  problem. 

a)  Constants,  Reference  Independent,  and  Reference 
Dependent  variables  are  assigned  to  the  common 
area  of  each  memory  unit,  using  the  same  base 
register.  Local  variables  are  assigned  to  a 
replicated  area  using  a  different  base  register 
or  set  of  base  registers. 

b)  It  is  helpful  in  assigning  memory  locations  to 
locate  variables,  which  are  in  each  other's 
affinity  sets,  at  the  same  displacements  from 
the  start  of  a  memory  word  to  allow  data  to  be 
moved  without  the  need  for  shifting  to  align 
the  data  with  the  destination  of  the  move. 
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c)  Since  links  are  transitions  from  one  phase  to 
another,  we  need  instructions  in  the  links  to 
move  data  items  used  in  both  phases  from  their 
locations  in  the  storage  mapping  of  the  exited 
phase  to  their  locations  in  the  storage  mapping 
of  the  entered  phase.  Generating  these 
instructions  and  the  storage  mapping  for  the 
links  after  the  storage  allocation  for  the 
phases  has  been  done  should  not  present  any 
great  difficulties. 

While  this  algorithm  has  generated  good  storage  allocations 
for  the  programs  we  have  analyzed,  no  claim  of  optimal ity  is  made  for 
it. 

3.6  Positioning  FORK,  HOLD,  and  QUIT  Instructions 

We  define  the  FORK,  HOLD,  and  QUIT  instructions  as  follows: 
FORK  -  When  a  FORK  instruction  is  encountered  by  an 

Address  Counter,  it  causes  another  Address  Counter 
to  start  executing  at  the  program  address  included 
in  the  FORK  instruction.  We  refer  to  this  address 
as  the  initiation  point  for  that  FORK. 
QUIT  -  When  a  QUIT  instruction  is  encountered  in  an 

instruction  stream,  it  results  in  the  release  of 
the  private  storage  that  had  been  assigned  to  the 
Address  Counter  executing  the  instruction.  That 
Address  Counter  then  returns  to  the  pool  of 
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inactive  Address  Counters  available  for  assign- 
ment to  new  processing  work. 
HOLD  -  If  the  Address  Counter  executing  a  HOLD  is  the 

last  active  Address  Counter,  it  executes  the  next 
instruction  in  its  instruction  stream.  If  it  is 
not  the  only  active  Address  Counter,  the  location 
of  the  HOLD  instruction  is  saved  and  the  Address 
Counter  is  released.  The  last  active  Address 
Counter  then  resumes  processing  at  the  instruc- 
tion following  the  HOLD  instruction  after  it 
executes  a  QUIT  instruction.  Our  rules  for 
inserting  FORK  instructions  guarantee  that  only 
one  HOLD  instruction  can  be  executed  to  leave  a 
phase.  Note  that  the  HOLD  instruction  is  not 
the  same  as  the  JOIN  instruction  defined  by 
Conway  [C0N63].  In  Conway's  machine,  only  the 
n   processor  to  reach  a  JOIN  instruction  is 
allowed  to  proceed  beyond  it,  where  n  is  set  by 
a  FORK  instruction.  In  our  machine,  the  last 
active  Address  Counter  executes  the  code  follow- 
ing the  HOLD  instruction. 
The  idea  of  using  FORK  instructions  to  initiate  parallel 
processing  is  not  a  new  one  [C0N63].  Usually  it  is  proposed  [0PL65, 
AND65,  WIR66]  that  the  programmer  insert  these  instructions  into  his 
code  at  places  he  believes  will  yield  correct  parallel  operation.  The 
type  of  concurrency  we  are  attempting  to  exploit,  however,  leads  to 
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very  simple  rules  for  inserting  FORK,  HOLD,  and  QUIT  instructions  so 
that  this  can  be  done  by  the  compiler.  Note  that  we  make  no  assump- 
tions when  inserting  these  instructions  about  the  independence  of  the 
processing  that  may  coincide  in  time  during  program  execution. 

Our  goal  in  inserting  FORK  instructions  is  to  cause  the  next 
input  record  to  enter  processing  as  early  as  possible.  Only  when  a 
path  has  been  selected  to  a  specific  READ  statement  can  the  appropriate 
FORK  be  executed.  A  FORK,  then,  is  always  located  after  a  conditional 
branch  instruction  which  selects  between  paths  leading  to  different 
READ  statements. 

In  a  program  involving  more  than  one  input  file,  we  select 
only  one  of  the  files  as  the  one  from  which  we  concurrently  process 
records.  This  file  is  referred  to  as  the  primary  input  file.  An 
initiation  point  is  associated  with  each  READ  statement  accessing  the 
primary  input  file.  READs  of  other  files  are  not  specially  handled. 
We  want,  as  the  primary  input  file,  the  file  which  in  some  way  controls 
the  processing,  such  as  a  "finder"  deck  identifying  records  to  be 
selected  from  another  file,  or  an  update  deck  which  selects  particular 
records  from  a  master  file  for  updating.  Because  this  controlling  file 
is  typically  accessed  less  often  than,  say,  a  master  file,  a  programmer 
tends  to  put  READ  statements  for  the  controlling  file  into  the  outer 
loop  of  a  phase.  Heuristically,  by  selecting  the  first  file  encoun- 
tered in  the  outermost  loop  of  a  phase  as  the  primary  file,  we  did  not 
select  the  wrong  one  in  any  of  our  sample  programs. 


* 
We  consider  the  end-of-file  option  of  a  READ  statement  to  be 

a  conditional  branch  instruction. 
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To  locate  the  position  at  which  we  want  to  place  the  FORK 
instruction,  we  use  the  following  algorithm  for  each  primary  READ 
statement. 

Algorithm  3.4  -  FORK  Insertion 

1)  Starting  at  the  node  in  the  program  graph 
corresponding  to  the  primary  READ  statement, 
follow  one  path  within  the  phase  at  a  time 
backward.   Ignore  any  node  which  does  not 
correspond  to  a  conditional  branch. 

2)  At  a  conditional  branch, 

a)  If  the  paths  leaving  the  conditional 
branch  lead  to  more  than  one  different 
primary  READ  statement,  or  any  path  has 
not  yet  been  traced,  position  a  FORK 
instruction  on  the  path  we  are  following 
at  a  point  immediately  after  the  condi- 
tional branch,  setting  the  initiation 
point  address  to  the  appropriate  state- 
ment as  in  Algorithm  3.5. 

b)  If  all  of  the  paths  from  the  conditional 
branch  reach  the  same  primary  READ, 
follow  back  along  the  path(s)  entering 
the  conditional  branch  as  in  step  (1). 
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To  locate  the  initiation  point  for  a  READ  statement,  we  must 
examine  the  node  immediately  preceding  the  READ  on  each  path  to  it. 

Algorithm  3.5  -  Initiation  Point  Identification 

1)  If  the  block  preceding  the  READ  is  not  a  block  of 
assignment  statements,  then  the  initiation  point 
address  is  the  address  of  the  READ  statement. 

2)  If  the  block  preceding  the  READ  is  a  block  of 
assignment  statements,  we  have  to  split  it  into 
two  blocks  as  follows.  Since  forward  substitution 
was  done  as  a  part  of  the  source  text  scan,  each 

of  the  statements  in  the  block  of  assignment  state- 
ments is  independent  of  the  rest.  We  can  thus 
reorder  them  so  that  all  of  the  statements  assign- 
ing values  to  Local  variables  follow  the  statements 
having  other  types  of  variables  as  output  variables. 
This  block  is  then  split  to  put  the  assignments  to 
Local  variables  into  a  separate  block  which  follows 
the  remainder  of  the  block.  The  initiation  point 
address  is  the  address  of  this  block  of  assignments 
to  Local  variables.  This  modification  of  the  orig- 
inal block  of  assignment  statements  is  necessary  to 
insure  that  all  initialization  needed  is  done. 

Algorithm  3.6  -  QUIT  Insertion 

1)  The  QUIT  instructions  are  located  after  all  terminal 
nodes  in  the  program.  Terminal  nodes,  which  include 
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such  instructions  as  STOP  and  GOBACK,  are  nodes 
whose  execution  on  a  single  Address  Counter  machine 
would  cause  termination  of  the  program. 
2)  QUIT  instructions  are  positioned  immediately  before 
all  initiation  points. 

Algorithm  3.7  -  HOLD  Insertion 

At  the  entries  to  links,  other  than  the  link  from  the 
program  entry  point,  we  position  HOLD  instructions  to 
insure  that  all  of  the  processing  in  a  phase  is  com- 
pleted before  the  processing  in  the  link  and  the  subse- 
quent phase  is  entered. 

Figure  3.5(a)  illustrates  a  simple  program  before  FORK,  HOLD, 
and  QUIT  instructions  are  inserted.  Figure  3.5(b)  shows  the  same  pro- 
gram after  these  instructions  are  inserted. 

3.7  Inserting  Interlocks 

In  Chapter  2  it  was  pointed  out  that  there  are  three  differ- 
ent types  of  interlocking  problems.  It  was  also  noted  that  two  of 
these  problems  could  be  solved  by  the  use  of  an  Instruction  Dispatch 
Unit,  which  is  discussed  in  section  5.4.  Since  compiler  algorithms  are 
not  needed  to  handle  these  two  problems,  we  need  concern  ourselves  now 
with  only  the  third  type  of  interlock. 

We  need  to  insert  two  types  of  instructions  for  each  inter- 
locked variable,  RELEASE  and  TEST.  For  each  path  from  a  primary  READ 
to  a  QUIT  we  want  to  insert  a  TEST  immediately  before  the  first  use  of 


48 


00   o 

♦  * 

<  < 

c 
o 

•I— 

s- 

CD 
to 

c 


Q 

_l 
O 


o 


ra 


2    1 

m  o 

o     \ 

H 

-H      OJ 

— ► 

WRI 

TREC 

HEAI 

<  < 

3 

O           i 

49 


the  interlocked  variable  on  that  path.  Similarly,  for  each  path  from  a 
primary  READ  to  a  QUIT  we  want  to  insert  a  RELEASE  immediately  after  the 
last  use  of  the  interlocked  variable  on  that  path. 

Algorithm  3.6  -  Interlock  Instruction  Insertion 

1)  Working  backward  from  each  QUIT,  for  each  Reference 
Dependent  variable  except  the  primary  input  file, 
examine  each  block  for  a  reference  to  the  Reference 
Dependent  variable.  Immediately  after  the  first 
such  block,  insert  a  RELEASE  for  that  variable.  At 
a  conditional  branch,  put  a  RELEASE  immediately 
after  the  exit  from  the  conditional  branch  unless 
all  exits  have  RELEASES  for  the  same  variable.  In 
this  case  delete  all  of  those  RELEASES  and  continue 
following  the  path  backward.  If  the  READ  statement 
is  reached  before  a  RELEASE  instruction  is  posi- 
tioned on  that  path,  insert  it  immediately  after 
the  READ. 

2)  Working  forward  from  each  initiation  point,  for 
each  Reference  Dependent  variable  except  the  pri- 
mary input  file,  examine  each  block  for  a  reference 
to  the  Reference  Dependent  variable.  Immediately 
ahead  of  the  first  such  block,  insert  a  TEST  for 
that  variable. 

As  an  example  of  this  process  consider  Figure  3.6.  We  assume 
in  this  example  that  the  variable  C  and  the  file,  FILE-0,  associated 


50 


with  OUTREC,  are  Reference  Dependent  variables.  Figure  3.6(a)  shows  the 

program  after  FORK,  HOLD,  and  QUIT  instructions  have  been  inserted. 

Figure  3.6(b)  shows  the  same  program  after  TEST  and  RELEASE  instructions 

have  been  inserted.  Note  in  Figure  3.6(b)  that  only  one  TEST  and  one 

RELEASE  have  been  inserted  for  each  variable  on  each  path. 
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Figure  3.6 
Interlock  Insertion  Example 


52 


yes 


< 


RELEASE  C 


<Qtest  file-o^> 


> 


WRITE 

OUTREC  FROM 

HEADER 


WRITE 

OUTREC  FROM 

A 


^RELEASE   FILE-O  ^> 


(b) 


Figure  3.6  (continued) 
Interlock  Insertion  Example 
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4.  PROOF  OF  THE  METHOD 

In  section  3.4  we  demonstrated  that  it  is  possible  to 
unambiguously  separate  the  set  of  variables  used  in  a  phase  into  four 
subsets.  It  was  further  shown  that  only  one  subset,  the  set  of 
Reference  Dependent  variables,  cause  the  existence  of  sequential  exe- 
cution constraints.  In  order  to  show  that  our  method  of  executing  a 
program  will  yield  the  same  output  as  a  single  processor  sequential 
machine,  we  need  to  prove  the  following: 

1)  If  our  machine  and  a  single  processor  both  access 
nodes  containing  references  to  Reference  Dependent 
variables  in  the  same  order,  then  both  yield  the 
same  output. 

2)  The  interlocking  method  we  proposed  in  section  3.7 
guarantees  the  proper  sequence  of  Reference 
Dependent  variable  accessing. 

In  developing  these  proofs  we  are  concerned  with  only  an  indi 
vidual  phase  since  this  is  where  we  apply  our  method  to  achieve  a 
speedup. 

4.1  Theorem  4.1 


Given  the  sets  of  references  to  Reference  Dependent  variables 
in  a  phase,  a  single  processor  machine  and  a  multiple  Address  Counter 
machine  both  yield  the  same  output  if  both  machines  execute,  in  the 
same  order,  nodes  containing  references  to  Reference  Dependent  variables 
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Consider,  first,  the  processing  of  two  consecutive  records  as 
it  is  done  by  a  single  processor  machine.  The  first  record  follows  some 
path  through  the  program  executing  a  sequence  of  nodes: 

Sl  =  nlT  n12'  n13'  •'•  ■  nli 
until  the  flow  of  control  in  the  program  causes  the  next  record  to  be 
read.  The  second  record  then  follows  a  path  through  the  program  execu- 
ting a  sequence  of  nodes: 

Thus  the  processing  of  the  two  records  requires  the  execution  of  a 
sequence  of  nodes: 

S12  ==  nH'  n12'  n13'  •'•  »  nli'  n2T  •"  '  n2j  * 
However,  we  found  previously  that  not  all  of  the  variables  involved  in 
the  execution  of  these  nodes  caused  problems  in  concurrently  processing 
the  data.  In  fact,  we  need  consider  only  those  nodes  which  contain 
references  to  Reference  Dependent  variables.  The  sequence  of  nodes  S,2 
then  reduces  to  the  sequence: 

S  =  n-j ,  n£,  r\y   ...  ,  n^ 

when  we  omit  any  node  in  S,p  that  involves  only  Constants,  Local  vari- 
ables, and  Reference  Independent  variables.  As  long  as  the  nodes  in 
set  S  are  executed  in  the  sequence  given  in  S  the  results  of  the  execu- 
tion are  the  same  regardless  of  what  the  method  is. 

4.2  Theorem  4.2 

Given  the  execution  sequence  S  from  Theorem  4.1,  the  inter- 
locking method  proposed  in  section  3.7  preserves  this  sequence. 
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For  the  case  of  the  Reference  Independent  variables  in  the 
program,  we  do  not  have  to  worry  that  accessing  them  affects  the  output 
unless  more  than  one  Address  Counter  can  access  the  variable  at  one 
time.  The  Instruction  Dispatch  Unit  discussed  in  section  5.4  prevents 
this  occurrence. 

For  Reference  Dependent  variables  we  must  return  to  the  dis- 
cussion of  the  required  sequence,  S,  for  the  execution  of  nodes  contain- 
ing Reference  Dependent  variables.  Note  that  the  sequence  S,?  is 
composed  of  two  sequences: 

S12  =  V  S2 
where  S-.  and  S2  are  defined  as  above.  Since  S  is  formed  from  S,«  by 
dropping  nodes  with  no  rearrangement,  it  follows  that  S  is  composed  of 
two  subsequences: 

S  =  S\  S" 
where  S'  5  S, ,  and  S"  E$o'     Consider  now  the  sequence  of  nodes,  0*. , 
in  which  a  variable,  v.,  is  referenced. 


'         Qi 

a.  c.  S 

°y  ^  s1 

t 

0."  c  S' 

It  is  apparent  that  two  different  Reference  Dependent  variables,  v.  and 
v.,  can  be  accessed  independently  of  each  other  and  their  sets  of  nodes, 

J 

a.   and  a.,  can  be  executed  concurrently  unless  there  is  a  node,  n  , 
which  is  common  to  both  sequences.  When  node  n  is  encountered  during 
the  execution  of  one  of  the  sequences,  the  execution  of  that  sequence 
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is  halted  until  the  execution  of  the  other  sequence  reaches  n  , 
whereupon  both  sequences  are  continued  in  execution.  Thus  it  is  possi- 
ble to  rearrange  sequence  S  into  a  sequence  having  three  parts.  The 

first  part  is  composed  of  the  portions  of  o.  and  a.   preceding  n  , 

I      j  c 

interleaved  in  any  manner.  The  second  part  is  n  .  The  third  part  is 

composed  of  the  portions  of  a.  and  a.  which  follow  n  in  S,  interleaved 

i     j  i» 

in  some  way.  This  argument  extends  in  a  straightforward  way  to  any 
number  of  variables  and  any  collection  of  nodes  in  S  which  are  common 
to  more  than  one  of  the  a.  sequences. 

In  order  for  us  to  guarantee  the  correctness  of  the  results  of 
the  program  while  concurrently  executing  the  processing  of  the  records, 
we  must  obey  the  following: 

1)  We  allow  the  sequences  of  nodes  containing  references 
to  different  Reference  Dependent  variables  to  proceed 
independently  until  a  node  is  encountered  which  con- 
tains reference  to  more  than  one  such  variable.  This 
node  cannot  be  executed  until  all  sequences  in  which 
it  appears  have  reached  it. 

2)  For  a  given  variable,  v.,  the  execution  of  the  pro- 
gram must  preserve  the  sequence  a..  In  particular, 
the  subsequence  a."  must  follow  the  subsequence  a.'. 

The  first  condition  is  satisfied  by  the  fact  that  we  do  not 
allow  a  block  to  be  executed  until  all  of  the  interlocks  on  that  block 
have  been  satisfied. 

The  second  condition  is  met  through  three  features  of  our 
technique. 
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1)  Each  Address  Counter  executes  its  own  subset  of 
nodes  sequentially,  taking  advantage  of  arithmetic 
parallelism  where  possible,  just  as  a  single 
Address  Counter  machine  would. 

2)  The  Instruction  Dispatch  Unit  protects  the  ordering 
within  the  sequences  a.'  and  a.". 

3)  The  interlocks  allow  a  variable  to  be  accessed  by 

an  Address  Counter  only  after  that  Address  Counter's 
predecessor  is  finished  with  the  variable,  thus 
guaranteeing  that  a."  does  not  start  being  executed 
until  after  a.'  is  finished. 

4.3  Discussion 

Having  shown  that  the  constraints  contained  in  the  method  we 
are  proposing  are  sufficient  to  guarantee  the  correctness  of  the  results 
of  the  program,  we  now  ask  if  they  are  necessary.  There  are  two  ques- 
tions which  we  must  investigate. 

1)  If  each  Address  Counter  has  access  to  only  its  own 
set  of  Local  variables  in  addition  to  the  global  (or 
common)  variables  (a  restriction  we  examine  in 
Chapter  8),  can  we  start  Address  Counters  into  oper- 
ation sooner  that  we  do  now? 

2)  Can  we  relax  or  remove  some  of  the  interlock 
constraints? 

Considering  question  (1)  we  demonstrate  that  the  Address 
Counters  cannot  be  put  into  operation  any  sooner  than  they  are  now. 
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Figure  4.1(a)  shows  a  portion  of  a  program  graph.  The  rules  for  setting 
our  FORK  instructions  cause  a  FORK  instruction  to  be  placed  at  the 
earliest  point  in  the  program  at  which  the  identity  of  the  next  primary 
READ  statement  to  be  executed  has  been  decided.  Thus  our  technique 
would  position  the  FORKs  at  the  beginning  of  path  1  and  of  path  2, 
immediately  after  the  decision  node  as  shown  in  Figure  4.1(b).  If  we 
attempt  to  bring  another  record  into  processing  at  any  point  prior  to 
the  locations  of  the  FORKs  we  clearly  get  into  trouble  since  we  do  not 
know  until  the  decision  node  has  been  executed  just  which  READ  we  exe- 
cute next  and  what  processing  ensues. 

As  to  question  (2)  concerning  removal  of  interlock  constraints, 
we  have  already  demonstrated  the  necessity  of  interlocking  Reference 
Dependent  variables.  However,  we  acknowledge  that  it  is  not  necessary 
to  protect  these  variables  by  preventing  the  whole  block  in  which  such 
a  variable  is  referenced  from  entering  execution.  This  decision, 
presented  in  section  3.7,  is  an  engineering  decision  based  on  the  belief 
that  it  is  more  important  to  prevent  a  potentially  deadlocking  condition 
than  to  achieve  the  ultimate  in  speedup  between  deadlocks. 
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5.  MACHINE  DESIGN 

5.1  Over-all  Structure 

The  following  features  must  be  available  in  a  machine  designed 
for  concurrent  record  processing: 

1)  A  number  of  independent  program  counters  to  bring 
about  the  concurrent  execution  of  different 
instruction  streams. 

2)  A  number  of  arithmetic  units  capable  of  operating 
independently  of  one  another  [FLY72b].  There  also 
must  be  no  correspondence  between  instruction 
streams  and  arithmetic  units;  an  instruction  from 
any  instruction  stream  is  executed  by  any  available 
arithmetic  unit. 

3)  A  number  of  memories  which  can  each  be  addressed 
by  any  of  the  arithmetic  units. 

4)  A  device  which  prevents  instructions  which  access 
the  same  variables  from  being  executed  in  an 
incorrect  sequence. 

The  following  items  are  desirable  in  a  machine  designed  for 
concurrent  record  processing: 

1)  Program  and  data  storage  should  be  kept  separate 
to  reduce  the  problem  of  access  conflicts. 

2)  Each  program  counter  should  have  associated  with  it 
the  necessary  logic  and  registers  to  calculate  the 
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effective  addresses  of  all  operands.  By  the  appro- 
priate settings  of  the  base  and  index  registers, 
each  program  counter  could  execute  the  same  instruc- 
tions but  refer  to  different  data  storage  areas 
when  necessary. 

3)  A  device  should  be  included  to  supervise  the  opera- 
tion of  all  program  counters.  Any  communications 
between  program  counters  could  travel  through  this 
device.  When  all  units  are  active  but  requests  are 
generated  for  the  activation  of  further  units,  this 
device  could  handle  the  enqueuing  of  these  requests 
until  program  counters  are  available  to  satisfy  the 
requests. 

4)  Since  much  of  the  activity  in  a  COBOL  program  is 
memory  oriented  (e.g.:  the  MOVE,  TRANSFORM,  and 
EXAMINE  verbs),  it  seems  desirable  to  build  into 
the  memory  units  some  processing  capability  to 
avoid  the  necessity  of  transferring  this  data  back 
and  forth,  to  and  from  a  processor.  Thus,  opera- 
tions which  require  only  one  operand  could  be  done 
in  the  memory  processor,  while  those  operations 
needing  more  operands,  which  would  be  found  in 
separate  memory  units,  would  be  handled  by  separate 
processing  units. 
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The  design  we  are  proposing,  shown  in  Figure  5.1,  incorporates 
the  necessary  and  the  desirable  features.  It  is  composed  of  the  follow- 
ing units: 

1 )  Address  Counters 

2)  Address  Counter  Coordinator 

3)  Instruction  Dispatch  Unit 

4)  Processors 

5)  Program  Memory 

6)  Data  Memory 

7)  I/O  Processors 

8)  Routing  Network 

The  design  of  each  of  these  units  is  now  discussed,  but  the 
various  numbers  and  sizes  of  units  recommended  on  the  basis  of  our 
experimental  results  is  deferred  until  Chapter  7. 


5.2  Program  Memory 

The  Program  Memory,  shown  in  Figure  5.2,  is  designed  as  a 
hierarchy  [KUC70,  MAT72]  of  memory  devices.  The  program  comes  ini- 
tially from  an  external  storage  medium,  such  as  disk  or  drum  storage, 
to  the  Primary  Program  Memory.  Address  Counters  obtain  instructions 
from  the  fast  Cache  memory  [BAR72a,  C0N69,  KAP73,  MEA71]  which  holds 
several  segments  of  the  program. 

The  design  of  this  memory  is  similar  to  that  of  the  IBM  360/85 
memory  [LIP68],  with  the  addition  of  the  Fetch  Queuing  and  Routing  Unit. 
This  unit  allows  any  of  the  Address  Counters  to  obtain  data  from  the 
Cache. 


63 


ADDRESS 
COUNTERS 


]'jJl       TTf.  1111 


•  •  • 


PROGRAM 
MEMORY 


INDEX   BUS 


U J 


A/ 

From 

IF  Tree 

Processors 


ADDRESS 
COUNTER 
COORDINATOR 


PROCESSOR   OPERATION   BUS 


Figure  5.1 
Over-all  Machine  Structure 


64 


From 

External 

Devices 


i 


PRIMARY 

PROGRAM 

MEMORY 


£_ 


CACHE 


I 


FETCH 

CONTROL 

UNIT 


FETCH    QUEUING 
&  ROUTING  UNIT 


T 


T 


From 

From 

Address 

Address 

Counter 

Counter 

1 

2 

T 


From 
Address 
Counter 
n 


Figure  5.2 
Program  Memory 


65 


5.3  Address  Counters 

An  Address  Counter,  shown  in  Figure  5.3,  operates  in  the  fol- 
lowing manner: 

1)  The  instruction  whose  address  is  in  the  Program 
Address  Register  is  fetched  from  the  Program  Memory 
and  placed  in  the  Memory  Buffer. 

2)  The  Op  Code  Decoder  examines  a  portion  of  the  opera- 
tion code  of  the  instruction  to  determine  the 
instruction  type.  The  six  instruction  types  recog- 
nized are  unconditional  branch,  conditional  branch, 
set  internal  registers,  fetch  index,  Address  Counter 
control ,  and  other. 

When  an  unconditional  branch  instruction  is  encountered,  the 
effective  address  is  calculated  and  inserted  into  the  Program  Address 
Register  for  the  next  instruction  fetch. 

When  a  conditional  branch  is  found,  the  effective  address  of 
the  conditional  result  set  is  calculated  and  the  conditional  test 
instruction  is  sent  to  the  Instruction  Dispatch  Unit.  The  Address 
Counter  ID  Match  Unit  is  also  armed  to  respond  to  the  appearance  on  the 
Index  Bus  of  this  Address  Counter's  identification  number.  After  the 
IF  Tree  Processor  evaluates  the  conditional  test,  it  sends  out  on  the 
Index  Bus  the  Address  Counter  identification  number  and  the  jump  dis- 
placement from  the  current  instruction  to  the  next  instruction  to  be 
executed.  This  displacement  is  then  stored  in  an  index  register.  The 
Address  Calculation  Unit  uses  the  program  address  and  the  jump  displace- 
ment to  compute  the  address  of  the  appropriate  entry  in  a  transfer 
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vector  table  and  inserts  this  address  into  the  Program  Address  Register 
for  the  next  instruction  fetch. 

When  the  operation  code  indicates  that  the  current  instruction 
loads  one  of  the  internal  registers,  either  the  operand  of  the  instruc- 
tion or  the  contents  of  the  Program  Address  Register,  as  the  instruction 
requires,  is  placed  in  the  selected  index  register. 

When  the  operation  code  indicates  that  the  current  instruction 
fetches  an  item  from  the  Data  Memory  to  an  index  register,  the  effective 
address  of  the  data  item  is  computed  and  the  instruction  is  passed  on  to 
the  Instruction  Dispatch  Unit.  The  Address  Counter  ID  Match  Unit  is 
also  armed  to  respond  to  the  appearance  on  the  Index  Bus  of  this  Address 
Counter's  identification  number.  When  the  Data  Memory  puts  this  identi- 
fication number  and  the  data  item  on  the  Index  Bus,  the  Address  Counter 
loads  the  appropriate  index  register  from  the  bus. 

An  Address  Counter  must  recognize  the  QUIT,  HOLD,  and  TEST 
instructions  and  halt  after  passing  them  on  to  the  Instruction  Dispatch 
Unit.  At  an  appropriate  time,  the  Address  Counter  is  restarted  by  the 
Address  Counter  Coordinator. 

For  any  of  the  other  instructions,  the  effective  addresses  of 
the  operands  are  calculated  and  the  instruction,  now  containing  full 
memory  addresses  for  the  operands,  is  sent  to  the  Instruction  Dispatch 
Unit. 

5.4  Instruction  Dispatch  Unit 

Figure  5.4  shows  a  design  for  the  Instruction  Dispatch  Unit. 
The  primary  function  of  this  unit  is  to  insure  that  no  instruction  is 
allowed  to  go  into  execution  until  all  of  its  operands  have  been  set  to 
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the  proper  value  and  are  available  in  the  Data  Registers  of  the  Data 
Memory . 

The  technique  for  accomplishing  this  objective  was  inspired 
by  the  method  used  to  solve  the  same  sort  of  problem  in  the  IBM  360/91 
[T0M67]  but  differs  from  that  solution  in  several  particulars.  The 
method  reported  by  Tomasulo  made  use  of  a  tag  associated  with  each 
operand.  The  tag  was  attached  to  the  register(s)  into  which  the 
operand  would  be  placed  when  it  became  available  and  represented  the 
identity  of  the  source  of  that  operand.  In  our  method  the  tag  has  no 
correlation  with  the  identity  of  the  source  of  the  operand.  Rather,  the 
tag  is  the  identity  of  the  Tag  Register  which  contains  the  Data  Memory 
address  and  status  of  that  operand.  The  tags  are  passed  around  the 
machine  to  identify  results  and  operands  when  needed,  with  the  tag 
always  eventually  returning  to  the  Instruction  Dispatch  Unit  as  an  indi- 
cation that  the  associated  data  item  is  available  for  use.  The  follow- 
ing description  of  the  operation  of  the  Instruction  Dispatch  Unit 
explains  our  technique: 

1)  An  instruction  is  accepted  from  the  Arriving 
Instruction  Queue. 

2)  If  the  instruction  is  destined  for  the  Address 
Counter  Coordinator,  it  is  immediately  dispatched 
to  that  unit. 

3)  For  other  instruction  types  (memory  and  processor 
instructions),  the  first  operand  is  sent  to  the 
Tag  Status  Register  Array.  That  unit  returns  to 
the  Fetch  and  Tag  Generator  the  identity  of  the 
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register  assigned  to  the  operand  and  an  indication 
of  whether  or  not  the  operand  was  previously  in  a 
Tag  Status  Register. 

4)  If  the  operand  was  new  to  the  Tag  Status  Registers, 
the  Fetch  and  Tag  Generator  issues  a  fetch  request 
to  the  memory,  sending  both  the  operand  address 
and  the  register  identity,  the  tag,  sent  by  the 
Tag  Status  Registers. 

5)  Steps  (3)  and  (4)  are  repeated  for  a  second  operand 
if  it  exists. 

6)  The  address  of  the  result  of  the  operation  is  then 
sent  to  the  Tag  Status  Register  Array  and  the 
resulting  tag  is  returned. 

7)  The  instruction  with  tags  appended  is  moved  into 
an  idle  Instruction  Waiting  Register. 

8)  When  the  tags  for  all  of  the  operands  return  from 
data  memory,  the  instruction  is  transferred  either 
onto  the  Memory  Bus  or  into  a  Processing  Instruction 
Queue.  In  the  former  case  the  appropriate  memory 
accepts  the  instruction  for  processing.  In  the 
latter  case  an  idle  processor  of  the  appropriate 
type  is  selected  and  the  instruction  is  routed  to 
it. 

9)  When  the  operation  has  completed,  the  memory  sends 
the  tag  of  the  result  to  the  Tag  Queue  in  the 
Instruction  Dispatch  Unit. 
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10)  When  the  tag  is  processed  by  the  Instruction 

Dispatch  Controller,  the  corresponding  Tag  Status 
Register  is  released  and  any  instruction  for 
which  all  the  other  operands  are  also  available 
is  started  into  execution. 
Logic  flow  diagrams  for  the  Fetch  and  Tag  Generator,  the 
Instruction  Dispatch  Controller,  and  the  Tag  Status  Register  Array 
appear  as  Figures  5.5  to  5.7. 

5.5  Address  Counter  Coordinator 

The  Address  Counter  Coordinator  is  one  unit  which  could  be 
implemented  with  a  large  portion  of  it  contained  in  the  operating 
system  software.  It  could  also,  on  the  other  hand,  be  implemented  com- 
pletely in  hardware.  Because  the  type  of  implementation  of  this  unit 
would  be  affected  by  many  considerations  beyond  the  scope  of  this 
paper,  no  structure  is  proposed  here.  Rather,  Figures  5.8  to  5.12  give 
the  control  sequence  for  each  of  the  five  instructions  executed  by  this 
unit.  The  functions  given  in  these  figures  would  have  to  be  executed 
regardless  of  the  software/hardware  proportion  of  the  implementation. 

In  Figure  5.8  the  control  sequence  of  the  FORK  instruction  is 
given.  In  the  event  that  all  Address  Counters  are  already  active,  the 
FORK  request  is  enqueued  until  one  does  become  available.  When  an 
Address  Counter  is  assigned  to  begin  execution  at  the  initiation  point 
specified  in  the  FORK  instruction,  several  things  must  be  done  before 
that  Address  Counter  can  begin  execution. 
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1)  Storage  must  be  allocated  [IS071]  for  private 
variables.  The  size  of  the  allocation  is  fixed 
for  all  initiation  points  within  a  phase. 

2)  The  base  registers  for  the  private  and  common 
storage  areas  must  be  set  in  the  Address  Counter. 
The  base  address  for  common  storage  is  fixed 
within  a  phase,  and  the  base  address  of  the 
private  storage  area  results  from  the  storage 
allocation  process  in  step  (1). 

3)  Interlocks  must  be  initialized  so  that  all  inter- 
locks associated  with  the  phase  initially  block 
the  successor  to  this  Address  Counter. 

4)  Information  must  be  maintained  which  allows  each 
Address  Counter's  immediate  predecessor  and 
successor  to  be  quickly  identified.  This  infor- 
mation is  needed  whenever  an  Address  Counter  tests 
an  interlock,  to  determine  whether  or  not  its 
predecessor  has  released  it.  It  is  also  needed 
whenever  an  Address  Counter  releases  an  interlock, 
to  restart  the  successor  if  it  is  waiting  for  this 
interlock  to  be  released. 

The  control  sequence  for  the  HOLD  instruction  is  shown  in 
Figure  5.9.  If  the  Address  Counter  executing  the  HOLD  instruction  is 
the  only  active  Address  Counter,  it  immediately  is  restarted  on  the  next 
instruction.  If  other  Address  Counters  are  still  active,  it  is  neces- 
sary that  they  be  allowed  to  complete  their  tasks  before  the  instructions 
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beyond  the  HOLD  instruction  are  executed  by  some  Address  Counter.  There 
are  several  ways  the  Address  Counter  which  executes  a  HOLD  under  these 
conditions  can  be  handled.  One  way  is  simply  to  leave  it  waiting  at  the 
HOLD  instruction  and  restart  it  when  the  last  Address  Counter  in  the 
phase  executes  a  QUIT  instruction.  Unfortunately,  this  approach  could 
be  quite  inefficient  if  the  waiting  time  is  \/ery   long,  since  it  leaves 
the  Address  Counter  locked  up.  The  approach  we  prefer,  shown  in 
Figure  5.9,  assumes  that  the  waiting  time  is  relatively  long,  that  the 
Address  Counter  might  be  usefully  employed  elsewhere,  and  that  the  over- 
head in  saving  the  necessary  information  to  start  another  Address  Counter 
at  this  location  at  a  later  time  is  not  prohibitive.  In  this  approach 
to  handling  the  waiting  period,  the  Address  Counter  registers  are  stored 
in  a  memory  for  later  restoration.  The  memory  used  to  save  the  registers 
could  be  one  dedicated  to  this  purpose,  or  a  part  of  a  data  memory  used 
by  the  operating  system.  A  request  is  added  to  the  HOLD  queue.  If 
there  are  any  requests  in  the  FORK  queue,  the  Address  Counter  now  begins 
execution  at  the  first  requested  initiation  point.  Otherwise  the 
Address  Counter  can  be  added  to  the  pool  of  idle  Address  Counters. 

When  a  QUIT  instruction  is  executed,  the  sequence  in 
Figure  5.10  is  followed.  Storage  assigned  to  the  Address  Counter  is 
released.  If  there  are  any  requests  in  the  FORK  queue,  the  Address 
Counter  now  begins  execution  at  the  first  requested  initiation  point. 
If,  instead,  there  is  a  request  on  the  HOLD  queue  and  the  Address 
Counter  is  the  last  one  active,  it  resumes  execution  of  the  program  at 
the  instruction  following  the  HOLD  instruction.  This  is  done  by  loading 
the  Address  Counter  registers  with  the  contents  of  the  registers  saved 
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when  the  HOLD  was  enqueued.  If  neither  of  these  conditions  hold  true, 
then  the  Address  Counter  is  returned  to  the  pool  of  idle  Address 
Counters. 

In  handling  the  interlocks  which  regulate  the  access  to 
Reference  Dependent  variables,  it  is  sufficient  to  have  an  n  x  v  bit 
array,  where  n  is  the  number  of  Address  Counters  and  v  is  the  number  of 
Reference  Dependent  variables  in  the  program  phase.  All  bits  in  the 
row  associated  with  an  Address  Counter  are  initially  set  when  that 
Address  Counter  is  started  into  operation  at  some  initiation  point. 
Whenever  a  TEST  instruction  is  encountered,  the  bit  corresponding  to  the 
combination  of  the  variable  designated  in  the  instruction  and  the  Address 
Counter's  predecessor  is  tested,  as  shown  in  Figure  5.11.  If  the  bit  is 
not  set,  execution  by  that  Address  Counter  is  resumed.  If  the  bit  is 
set,  the  Address  Counter  waits  at  that  instruction  until  the  interlock 
is  released  by  its  predecessor.  This  is  done  rather  than  storing  the 
registers  and  freeing  the  Address  Counter  for  other  work  because  the 
waiting  period  in  this  case  is  expected  to  be  relatively  short.  When  a 
RELEASE  instruction  is  encountered,  the  appropriate  interlock  bit  is 
reset  and  a  test  is  made  to  see  if  the  successor  to  that  Address  Counter 
is  waiting  for  that  interlock  to  be  released,  as  shown  in  Figure  5.12. 
If  so,  the  successor  is  reactivated.  The  Address  Counter  releasing  the 
interlock  executes  its  next  instruction  without  waiting  for  the  execu- 
tion of  the  RELEASE  instruction. 

5.6  Processors 

There  are  a  number  of  processors  in  the  machine  we  are  pro- 
posing in  this  chapter.  A  number  of  these  processors  are  simply 
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arithmetic  and  logical  units.  As  with  past  machines  intended  for  busi- 
ness data  processing  [ADA60],  these  arithmetic  processors  should  be 
designed  to  do  decimal,  rather  than  binary,  arithmetic. 

In  addition  to  the  arithmetic  processors,  several  other  types 
are  included. 

The  IF  Tree  Processor  proposed  by  Davis  [DAV72b]  accepts  as 
input  the  conditional  result  set.  Each  bit  in  the  conditional  result 
set  represents  the  result  of  evaluating  the  conditional  expression  from 
one  IF  statement.  The  processor  returns  as  output  the  identification 
of  the  branch  of  a  conditional  tree  traversed.  By  evaluating  all  of 
the  conditional  expressions  concurrently,  passing  the  results  to  an 
IF  Tree  Processor,  then  executing  a  small  number  of  assignment  state- 
ments concurrently,  this  device  allows  parallel  execution  of  conditional 
branches  which  would  otherwise  degrade  speedup  badly  [TJA70>  RIS72]. 
Because  COBOL  programs  tend  to  have  large  and  complex  decision  trees 
compared  to  those  in  a  typical  numerical  program,  and  because  several 
data  records  are  undergoing  processing  concurrently,  several  of  these 
IF  Tree  Processors  are  needed. 

Very   commonly  COBOL  programs  sort  a  file  on  one  or  more  keys 
contained  in  each  record.  Because  of  this  use  of  the  SORT  operation  and 
the  time-consuming  nature  of  software  methods  of  sorting  large  amounts 
of  data,  there  should  be  a  sorting  network  included  as  one  of  the  pro- 
cessors. The  networks  described  by  Batcher  [BAT68]  are  good  candidates 
for  this  job. 
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5.7  Data  Memory  and  Buses 

Each  Data  Memory  Unit,  shown  in  Figure  5.13,  includes  a 
Primary  Memory  Module,  a  Data  Register  Array,  and  a  Function  Unit. 

Data  items  are  stored  in  the  Primary  Memory  Module  until  re- 
quested by  a  fetch  request  from  the  Instruction  Dispatch  Unit.  Fetch 
requests  are  enqueued  by  the  Control  Logic.  The  requested  data  is 
transferred  to  one  of  the  Data  Registers  before  it  is  required  by  a 
processor,  rather  than  after  as  in  the  case  of  slave  memories  [WIL65] 
or  cache  memories  [LIP68].  Associated  with  each  register  is  a  word  in 
the  Address  Memory,  an  associative  memory  which  contains  the  Primary 
Memory  address  of  the  contents  of  the  register.  This  address  is  set 
during  a  fetch  from  the  Primary  Memory  Module  or  during  the  transfer  of 
a  result  value  from  a  processor.  To  avoid  unnecessary  Primary  Memory 
fetches,  the  Address  Memory  is  searched  for  each  fetch  request  in  hopes 
of  recovering  previously  used  data.  For  each  operand  request  from  a 
processor,  it  is  searched  to  determine  which  register  contains  the  re- 
quested operand.  When  a  result  is  received  from  a  processor,  the 
Address  Memory  is  searched.  If  a  register  has  already  been  allocated 
for  this  item,  the  item  is  placed  in  that  register.  Otherwise,  the 
least-recently  used  register  is  allocated  for  this  item.  The  address 
in  the  Address  Memory  is  altered  only  when  a  new  item  is  to  be  written 
into  the  register.  Flag  bits  are  provided  for  each  register  to  indi- 
cate its  status.  The  indications  are: 

1)  Waiting  for  Request  -  Data  has  recently  been  fetched 
but  has  not  been  requested  by  a  processor. 
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2)  Waiting  for  Store  -  A  result  has  been  sent  by  a 
processor  but  has  not  yet  been  stored  in  the 
Primary  Memory. 

3)  Not  Waiting  -  Data  has  been  fetched  and  has  been 
sent  on  to  a  processor,  or  a  store  has  been  com- 
pleted. The  register  is  available  for  reassignment. 

Within  each  Data  Memory  Unit  a  set  of  special  processors  is 
provided  to  handle  those  functions  which  do  not  require  a  full  processor 
of  one  of  the  types  described  in  section  5.6.  These  memory  processors 
reduce  the  demand  on  the  Routing  Network.  Functions  which  these  memory 
processors  perform  include  the  following: 

1)  Data  Transformation  -  In  COBOL  the  TRANSFORM  state- 
ment is  used  to  change  all  occurrences  of  one  set 
of  characters  into  another  set  of  characters 
within  a  data  item.  For  example, 

TRANSFORM  A  FROM  '$<£'  TO  'DC. 
results  in  the  change  of  all  occurrences  of  the 
$  character  in  data  item  A  to  the  letter  D  and 
all  occurrences  of  the  character  t   to  the  letter  C. 

2)  Character  Examination  -  In  COBOL  the  EXAMINE  state- 
ment is  used  to  count  the  number  of  occurrences  of 
a  character  within  a  data  item.  It  can  also  be 
used  to  transform  each  occurrence  of  that  charac- 
ter to  another  character.  COBOL  also  includes 
tests  to  determine  if  a  data  item  is  numeric  or  if 
it  is  alphabetic. 
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3)  Counter  Incrementing  -  A  very   common  statement 
in  COBOL  programs  is 

ADD  1  TO  item. 
Since  we  can  regard  incrementing  a  value  as  a 
monadic  operation  on  that  variable,  there  is  no 
need  to  route  the  value  through  the  Routing 
Network  to  a  processor,  perform  the  operation, 
and  return  the  result  through  the  Routing 
Network  back  to  the  same  memory  unit. 

4)  Another  common  type  of  COBOL  statement  is 

MOVE  SPACES  TO  item. 
or 

MOVE  ZEROS  TO  item, 
where  the  number  of  spaces  or  zeros  is  determined 
by  the  length  of  the  item.  The  operation  of 
jamming  one  of  these  values  into  an  item  could  be 
done  by  logic  built  into  the  Data  Register  Array 
[ST070]. 
In  Figure  5.1  it  can  be  seen  that  communications  between 
memories  occur  over  the  Inter-Memory  Bus.  This  time-shared  bus  is  ade- 
quate since  the  storage  assignment  algorithm  described  in  section  3.5 
attempts  to  keep  data  items  which  have  an  affinity  for  each  other  in 
the  same  memory  unit.  Calculations  related  to  the  necessary  bus 
bandwidth  are  given  in  Chapter  7. 
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5.8  I/O  Processors 

In  the  paper  up  to  this  point  we  have  been  assuming  that  read- 
ing and  writing  data  records  takes  very  little  time—little  enough  time 
that  an  I/O  Processor  can  keep  up  with  the  demands  of  several  instruc- 
tion streams.  Obviously,  a  very   sophisticated  I/O  Processor  is  needed. 

We  are  not  attempting  to  design  such  a  processor  here  since 
such  a  design  depends  heavily  on  the  capabilities  of  the  bulk  storage 
devices  with  which  it  interfaces  and  the  technologies  available  for  its 
implementation.  There  are  some  comments  we  can  make  regarding  such  a 
design,  however,  which  derive  from  observations  of  our  example  programs. 

Since  some  COBOL  programs  operate  on  several  input  and  output 
files,  it  seems  advantageous,  in  view  of  the  throughput  rates  needed, 
to  have  several  I/O  Processors.  As  long  as  the  number  of  files  being 
accessed  is  not  greater  than  the  number  of  I/O  Processors,  each  file 
should  be  assigned  to  a  separate  I/O  Processor. 

One  way  of  achieving  a  very  fast  I/O  rate  is  to  place  all  of 
the  data  in  a  random  access  memory.  Reading  or  writing  then  amounts  to 
a  transfer  of  information  from  one  memory  to  another.  If  the  file 
sizes  are  yery   large,  the  amount  of  such  buffer  memory  needed  becomes 
prohibitively  expensive.  However,  it  is  apparent  that  the  larger  the 
buffer  memory  can  be  made  the  closer  we  can  approach  this  ideal.  A 
large  memory,  filled  before  program  execution  starts,  could  be  at  least 
partially  refilled  while  the  original  data  is  processed.  When  an  Address 
Counter  encounters  a  READ  or  WRITE  instruction  which  cannot  be  executed 
immediately,  because  of  the  unavailability  of  data  or  buffer  space, 
there  are  several  things  that  could  be  done.  One  is  simply  to  wait 
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until  it  is  possible  to  proceed.  In  view  of  the  disparity  between 
machine  operation  times  and  rotating  storage  access  times,  this  approach 
could  be  very  wasteful  of  execution  resources.  A  better  way  is  to 
create  an  I/O  queue  and  an  I/O  save  area  in  the  Address  Counter 
Coordinator  similar  to  those  used  for  the  enqueued  HOLD  instructions. 
Subsequent  instructions  accessing  the  same  file  would  have  to  be  chained 
together  to  insure  that  they  are  executed  in  the  proper  order,  and  a 
mechanism  for  restarting  an  instruction  stream  when  the  data  is  avail- 
able would  also  have  to  be  implemented;  but  neither  of  these  problems 
seems  especially  difficult. 

To  keep  the  amount  of  data  transferred  between  memories  dur- 
ing an  I/O  instruction  small,  it  seems  apparent  that  the  I/O  Processor 
should  have  a  description  of  the  record  format.  This  information 
allows  the  following  economies: 

1)  Only  those  items  actually  used  from  an  input  record 
would  be  transferred  from  an  input  record  to  the 
Data  Memory  Units.  It  is  quite  common  for  only  a 
few  items,  from  a  large  set  of  items  in  a  record,  to 
be  used  during  the  execution  of  the  program. 
Fillers  and  unused  data  items  would  be  discarded. 

2)  Any  constants  appearing  in  output  records,  such  as 
page  headings,  could  be  retained  in  the  I/O 
Processor,  eliminating  the  need  to  continually 
transfer  this  invariant  information  between  data 
memory  and  the  I/O  Processor. 
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5.9  Routing  Network 

It  is  necessary  to  provide  some  method  [KUC72a]  of  allowing 
any  processor  to  interrogate  any  Data  Memory  Unit.  Two  methods  are  com- 
bined in  Figure  5.1.  The  first,  the  Switching  Network,  is  a  crossbar 
switch,  using  a  few  of  the  high  order  bits  of  the  operand  to  select  a 
path  through  the  network.  The  second  method  is  a  system  of  time-shared 
buses  connecting  groups  of  Data  Memory  Units  and  groups  of  processors  to 
the  Switching  Network.  The  sizes  of  the  groups  are   determined  by  the 
number  of  memories  and  processors,  and  are  affected  by  tradeoffs  between 
the  size  of  the  Switching  Network,  bus  bandwidth,  and  the  bus  holding 
times.  Some  calculations  relating  to  this  problem  are  given  in 
Chapter  7. 

5.10  Modifications  for  Multiprogramming 

Thus  far  we  have  been  assuming  that  only  one  program  at  a  time 
is  in  execution.  We  now  briefly  consider  the  modifications  needed  to 
execute  more  than  one  program  concurrently. 

There  need  be  no  changes  in  the  algorithms  used  in  the  com- 
piler. All  of  the  FORKs,  HOLDs,  QUITs,  and  interlocks  operate  only 
between  different  records  being  processed  by  the  same  program.  No  vari- 
ables are  shared  between  different  programs,  although  they  may  share 
files  of  data. 

The  processors  and  memories  need  not  be  changed,  except  in 
memory  size,  since  they  are  indifferent  to  the  source  of  the  instructions 
that  pass  through  them. 
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The  Address  Counters  do  not  operate  differently  for  different 
programs.  Each  is  independent  of  the  others  except  for  the  handling  of 
interlock  conditions.  Since  interlock  handling  is  done  by  the  Address 
Counter  Coordinator  and  not  by  individual  Address  Counters,  the  design 
of  the  Address  Counter  need  not  be  modified  to  support  multiprogramming. 

The  largest  changes  must  be  made  in  the  Instruction  Dispatch 
Unit  and  in  the  Address  Counter  Coordinator.  Since  instruction  streams 
for  different  programs  are  independent,  it  is  possible  to  route  instruc- 
tions from  different  programs  through  different  Instruction  Dispatch 
Units.  This  appears  to  be  a  good  thing  to  do  since  it  is  the  Instruction 
Dispatch  Unit  which  limits  the  speed  of  our  machine,  as  discussed  in 
section  7.1.  In  order  to  implement  this  modification,  a  switch  has  to 
be  inserted  to  allow  any  Address  Counter  to  send  instructions  to  any 
Instruction  Dispatch  Unit.  The  Address  Counter  Coordinator  would,  then, 
have  to  be  modified  to  allow  it  to  control  this  switch  and  to  keep  inter- 
locks from  one  program  distinct  from  those  in  another  program. 
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6.  EXPERIMENTAL  RESULTS 

6.1  Introduction 

Several  analyses  were  made  of  a  sample  set  of  COBOL  programs. 
These  programs  were  obtained  from  the  Student  Data  Area,  the 
Institutional  Area,  and  the  Financial  Area  in  the  University  of  Illinois 
Office  of  Administrative  Data  Processing  at  the  Urbana  campus.  Most  of 
the  programs  were  considered  by  the  programmers  to  be  typical  of  the 
types  of  processing  done  at  that  installation,  but  some  programs  were 
selected  because  they  were  atypical  and  might  be  expected  to  tax  any 
method  chosen  for  their  execution.  While  this  sample  is  limited  to 
those  programs  available  from  a  single  university  administrative  com- 
puter center,  we  feel  that  they  are  comparable  to  programs  found  in 
various  businesses.  For  one  example,  a  program  which  generates  a 
report  from  a  student  master  file  is  similar  to  one  which  generates  a 
report  from  a  file  of  an  insurance  company's  policy  holders.  As  another 
example,  consider  the  similarity  of  a  program  which  prints  student  grade 
reports  and  one  which  prints  charge  account  bills.  Of  course,  there  are 
also  many  functions  common  to  the  business  and  academic  worlds,  such  as 
maintenance  of  inventory  records  and  payroll  records.  Thus,  while  our 
sample  is  limited,  it  does  seem  to  be  representative  of  COBOL  programs 
in  general . 

6.2  Variable  Size  Counts 

One  analysis  we  made  was  of  the  frequency  of  occurrence  of 
variables  of  various  sizes  in  our  sample  of  42  programs.  Fortunately, 
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the  Data  Division  of  a  COBOL  program  contains  a  description  of  every 
variable  used  in  the  program.  This  description  contains  a  PICTURE 
clause  which  contains  information  about  the  number  of  characters  needed 
to  hold  an  item's  value.  Also  included  in  the  Data  Division  are  a 
number  of  entries  with  the  name  FILLER.  These  are  typically  used  to 
indicate  items  in  a  file  which  are  not  accessed,  or  to  hold  spaces  or 
text  for  printout  format.  Since  we  discard  unused  data  items,  such  as 
those  in  the  former  case,  and  since  we  hold  format  information  and 
constant  printout  text  in  the  I/O  Processors,  we  discarded  all  FILLER 
items  in  this  analysis. 

A  summary  of  the  results  of  this  analysis  is  given  in 
Table  6.1.  It  should  be  noted  that  some  entries  in  this  table  have  two 
values.  In  these  cases  it  was  found  that  one  or  more  of  the  programs 
in  our  sample  used  an  unusually  large  table  (i.e.:  a  vector  or  2-  or 
3-dimensional  array).  Had  we  had  a  very  large  sample,  we  would  expect 
such  single  occurrences  of  large  counts  to  be  swamped  out  by  the  rest 
of  the  mass  of  the  data.  In  our  sample  of  42  programs,  however,  such 
single  occurrences  swamped  out  the  rest  of  the  counts.  To  overcome  this 
problem  we  have  deleted  the  portion  of  the  data  attributable  to  large 
single  tables;  but  we  have  given,  as  the  parenthesized  value  in  the 
table,  the  counts  which  do  include  such  tables.  A  plot  of  the  fre- 
quency counts  in  Table  6.1  is  given  in  Figure  6.1. 

6.3  Statement  Type  Counts 

Another  type  of  analysis  we  did  was  count  the  frequency  of 
occurrence  of  each  type  of  statement.  This  analysis  was  done  for  34  of 
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Table  6.1 
Frequency  Count  of  Variable  Sizes 


Number 

Size 

of 

(Char) 

Occurrences 

1 

4711 

(24912) 

2 

2461 

3 

2217 

(  3917) 

4 

2062 

(12618) 

5 

2980 

(  4996) 

6 

2438 

(  4201) 

7 

118 

8 

142 

9 

285 

( 

410) 

10 

593 

( 

1093) 

11 

27 

12 

34 

13 

24 

( 

524) 

14 

28 

( 

94) 

15 

24 

16 

16 

17 

14 

18 

55 

( 

73) 

19 

5 

20 

103 

( 

299) 

21 

31 

22 

2 

( 

70) 

23 

24 

24 

4 

25 

7 

26 

2 

27 

4 

28 

3 

( 

41) 

29 

8 

30 

8 

( 

28) 

31 

2 

32 

0 

33 

5 

( 

13) 

34 

1 

35 

1 

36 

0 

37 

2 

38 

0 

39 

2 

40 

6 

41 

2 

42 

1 

43 

1 

Mean 

Median 

Percent 

per 

per 

of 

Program 

Program 

Sample 

112 

33 

25.29 

58.7 

18 

13.21 

52.8 

13 

11.90 

49.1 

11 

11.07 

71.0 

14 

16.00 

58.0 

7 

13.09 

2.8 

0 

0.63 

3.4 

1 

0.76 

6.8 

3 

1.53 

14.1 

5 

3.18 

0.6 

0 

0.14 

0.8 

0 

0.18 

0.6 

0 

0.13 

0.7 

0 

0.15 

0.6 

0 

0.13 

0.4 

0 

0.09 

0.3 

0 

0.08 

1.4 

0 

0.30 

0.1 

0 

0.03 

2.5 

1 

0.55 

0.7 

0 

0.17 

0 

0 

0.01 

0.6 

0 

0.13 

0.1 

0 

0.02 

0.2 

0 

0.04 

0 

0 

0.01 

0.1 

0 

0.02 

0 

0 

0.02 

0.2 

0 

0.04 

0.2 

0 

0.04 

0 

0 

0.01 

0 

0 

- 

0.1 

0 

0.03 

0 

0 

0.01 

0 

0 

0.01 

0 

0 

- 

0 

0 

0.01 

0 

0 

- 

0 

0 

0.01 

0.1 

0 

0.03 

0 

0 

0.01 

0 

0 

0.01 

0 

0 

0.01 
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Table  6.1  (continued) 
Frequency  Count  of  Variable  Sizes 


Number 

Mean 

Median 

Percent 

Size 

of 

per 

per 

of 

(Char) 

Occu 

rrences 

Program 

Program 

Sample 

44 

2 

0 

0 

0.01 

45 

0 

0 

0 

- 

46 

1 

0 

0 

0.01 

47 

1 

0 

0 

0.01 

48 

4 

0.1 

0 

0.02 

49 

4 

0.1 

0 

0.02 

50 

4 

0.1 

0 

0.02 

Counts  for  Sizes  >  50 


Size 


Count 


Size 


Count 


51 

1 

52 

1 

54 

2 

55 

2 

57 

3 

58 

3 

60 

8 

62 

1 

63 

1 

66 

27 

70 

4 

72 

2 

74 

2 

76 

2 

80 

14 

81 

8 

83 

1 

84 

1 

85 

2 

89 

2 

90 

1 

92 

1 

93 

2 

100 

12 

104 

1 

107 

1 

109 

2 

no 

1 

116 

1 

120 

2 

126 

1 

132 

2 

133 

24 

136 

1 

150 

4 

162 

1 

181 

1 

191 

1 

203 

1 

205 

1 

264 

1 

287 

1 

310 

1 

312 

1 

379 

1 

392 

1 

398 

1 

405 

1 

449 

1 

700 

1 

1500 

1 

96 


5000 


Figure  6.1 

Frequency  Count  Plot 
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our  programs,  since  eight  of  the  42  programs  were  too  large  for  us  to 
transform  into  an  analyzable  form.  Table  6.2  gives  a  summary  of  the 
results  of  this  analysis.  Figure  6.2  is  a  plot  of  the  data  in  Table  6.2 
broken  down  into  classes  of  statements.  Figure  6.3  is  a  plot  of  the  data 
in  Table  6.2  broken  down  into  individual  statement  types. 

As  a  part  of  the  same  analysis,  we  counted  the  frequency  of 
occurrence  of  each  of  the  operators  available  in  COBOL.  A  summary  of 
this  data  is  shown  in  Table  6.3.  Since  incrementing  is  very   common,  it 
was  counted  separately  from  other  types  of  ADD  instructions.  Figure  6.4 
presents  the  data  of  Table  6.3  broken  down  by  operator  class. 
Figure  6.5  shows  the  data  of  Table  6.3  broken  down  by  individual  opera- 
tor type. 

6.4  Program  Analyses 

A  number  of  the  programs  were  examined  to  determine  what  sort 
of  speedup  our  method  might  actually  deliver.  In  the  process,  informa- 
tion relating  to  machine  parameters  was  also  generated. 

Table  6.4  gives  a  brief  summary  of  the  statistics  relating  to 
the  sizes  of  the  programs  that  were  analyzed.  It  should  be  noted  that 
these  are  not  large  programs  since  the  analyses  were  done  by  hand.  They 
seem  to  be  typical  of  the  small-  to  medium-sized  programs  found  at  any 
COBOL  computer  center.  The  following  is  a  list  of  the  programs  we 
analyzed  with  a  brief  description  of  each  one: 

B7510363  A  report  and  an  error  listing  are  generated 
from  one  input  file.  This  program  required 
heavy  interlocking. 
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Table  6.2 
Frequency  Count  of  Statement  Types 


Statement 
Type 

All  Statements 

I/O 

READ 

WRITE 

REWRITE 

ACCEPT 

DISPLAY 

EXHIBIT 

RETURN 

RELEASE 

OPEN 

CLOSE 

Assignment 

MOVE 

TRANSFORM 
Ari  thmeti  c 

Control 

IF 
PERFORM 


Misc. 


EXAMINE 
SORT 
ON,  AT 
CALL 


Number 

of 

Occurrences 

Percent 
of  all 
Occurrences 

Mean 
per 
Program 

Medi  an 

per 
Program 

16945 

100 

498.4 

143 

1583 

9.3 

46.5 

26 

134 

0.8 

3.9 

2 

900 

5.3 

26.5 

7 

3 

0 

0.1 

0 

18 

0.1 

0.5 

0 

242 

1.4 

7.1 

5 

56 

0.3 

1.6 

0 

22 

0.1 

0.6 

0 

67 

0.4 

2.0 

0 

69 

0.4 

2.0 

2 

72 

0.4 

2.1 

2 

10735 

63.4 

315.7 

84 

8852 

52.2 

260.3 

62 

101 

0.6 

3.0 

0 

1782 

10.5 

52.4 

16 

4037 

23.8 

118.7 

27 

3998 

23.6 

117.6 

27 

39 

0.2 

1.1 

0 

590 

3.5 

17.4 

4 

4 

0 

0.1 

0 

65 

0.4 

1.9 

0 

492 

2.9 

14.5 

4 

29 

0.2 

0.9 

0 

99 


63.4 


Figure  6.2 
Histogram 
Statement  Types  by  Class 
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Table  6.3 
Frequency  Count  of  Operator  Types 


Operator 
Type 

Number 

of 

Occurrences 

Percent 
of  all 
Occurrences 

Mean 
per 
Program 

Median 

per 

Program 

Operators 

7391 

100 

217.4 

59 

Arithmetic 

2122 

28.7 

62.4 

16 

Increment 

390 

5.3 

11.5 

9 

+ 

934 

12.6 

27.8 

4 

- 

50 

0.7 

1.5 

0 

• 

375 

5.1 

11.0 

0 

/ 

341 

4.6 

10.0 

0 

** 

15 

0.2 

0.4 

0 

Comparison 

4655 

62.9 

136.9 

29 

= 

3216 

43.5 

94.6 

21 

< 

170 

2.3 

4.9 

2 

> 

496 

6.7 

14.6 

3 

f 

641 

8.7 

18.9 

5 

i 

24 

0.3 

0.7 

0 

f 

68 

0.9 

2.0 

0 

Connective 

584 

7.9 

17.2 

4 

OR 

216 

2.9 

6.4 

2 

AND 

368 

5.0 

10.8 

0 

Misc. 

30 

0.4 

0.9 

0 

NUMERIC 

30 

0.4 

0.9 

0 

ALPHABETIC 

0 

0 

0 

0 

102 
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Table  6.4 
Statistics  for  Analyzed  Programs 


Number    Number      Number     Number    Number 
of       of        of        of      of 


Program  ID 

Data  Cards 

Variables 

Proc.  Cards 

Statements 

Phasi 

B7510360 

149 

60 

156 

140 

1 

15156040 

378 

202 

212 

172 

2 

15156050 

46 

94 

70 

49 

2 

15210030 

200 

75 

88 

65 

1 

15212005 

158 

66 

229 

183 

3 

SSN512 

69 

160 

81 

84 

2 

S7510025 

68 

4 

7 

5 

1 

S7550180 

217 

85 

84 

73 

2 

S7550181 

215 

33 

73 

59 

2 

S7550182 

68 

40 

106 

101 

2 

S7550183 

230 

90 

121 

169 

2 
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15156040  From  one  input  file  this  program  prints  Avery 
labels,  sorts  the  file,  then  prints  a  report. 

15156050  Records  from  the  input  file  are  read,  and 

selected  items  are  copied  to  an  intermediate 
file  which  is  then  sorted.  A  report  is  gen- 
erated from  the  sorted  file. 

15210030  From  an  old  master  and  finder  cards,  this 
program  copies  the  old  master  into  a  new 
master,  updating  selected  records. 

15212005  This  includes  two  programs  in  one.  The  first 
one  reads  one  file,  edits  the  input,  then 
outputs  a  modified  file.  The  second  one 
uses  a  finder  file  to  select  master  file 
records,  then  outputs  records  with  data  from 
both  finder  and  master  records. 

SSN512    Data  from  one  input  file  with  additional  data 
from  a  master  file  is  copied  to  the  output 
file. 

S7510025  This  program  generates  an  output  file  of  1350 
identical  records.  It  does  no  input. 

57550180  This  program  generates  an  output  file  from  an 
input  file  plus  information  from  matching 
master  records. 

57550181  This  program  is  similar  to  S7550180,  but  it 
does  more  calculation  of  output  data. 
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57550182  This  program  is  similar  to  S7550180,  but  it 
copies  more  data  into  an  output  file. 

57550183  This  program  is  similar  to  S7550182. 

As  a  part  of  the  analysis,  storage  allocation  was  done  for 
each  phase.  In  view  of  the  peaks  in  Figure  6.1  at  variable  sizes  of  5, 
10,  and  20  characters,  an  allocation  was  made  for  each  of  these  sizes. 
Table  6.5  presents  the  number  of  words  needed  and  the  percentage  utili- 
zation for  each  of  these  allocations.  The  significance  of  these 
results  is  discussed  in  Chapter  7. 

6.5  Program  Simulation 

After  the  detailed  analysis  of  a  program  was  complete,  its 
execution  was  simulated  to  gather  further  information  about  machine 
parameters  and  to  obtain  an  estimate  of  the  speedup  possible  using  our 
method.  For  this  simulation  the  following  rules  were  used: 

1)  The  effects  of  storage  access  conflicts  were 
ignored.  This  is  justified  since  the  multiplicity 
of  Data  Memory  Units  and  the  use  of  the  Storage 
Allocation  Algorithm  of  section  3.5  should  keep 
the  number  of  such  conflicts  small. 

2)  IF  Trees  of  four  or  fewer  levels  were  given  a  con- 
current execution  time  of  one  unit  time,  while 
larger  IF  Trees  were  given  an  execution  time  of 
two  units  of  time.  All  of  the  IF  Trees  we  found 
could  be  arranged  in  eight  or  fewer  levels.  These 
execution  times  are  in  line  with  Davis's  results 
[DAV72a]. 
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3)  Reading  records  from  the  primary  input  file  was 
assumed  to  take  no  time.  We  assume  that  data  from 
the  primary  input  file  is  transferred  from  the 

I/O  Processor  to  the  Data  Memory  before  the  READ 
statement  is  executed.  For  statements  accessing 
other  files,  we  assume  that  it  takes  one  cycle  to 
move  the  data  between  the  Data  Memory  and  the 
I/O  Processor. 

4)  Fetching  and  storing  data  were  assumed  to  take  no 
execution  time.  In  the  case  of  store  instructions, 
data  to  be  stored  is  kept  in  one  of  the  Data 
Registers  in  a  Data  Memory  Unit  until  the  Primary 
Memory  has  a  free  cycle.  Allotting  no  time  to  a 
fetch  of  data  by  a  processor  is  justified  on  the 
following  grounds.  We  can  consider  the  Instruction 
Dispatch  Unit  to  be  a  "black  box"  which  releases 
each  instruction  some  time  after  it  arrives  from 

an  Address  Counter.  While  the  delay  is  not  the 
same  for  each  instruction,  we  can  say  that  there  is 
some,  unspecified,  average  delay  for  each  program. 
We  would  expect  this  average  delay  to  be  much 
smaller  than  the  total  execution  time  of  the  pro- 
gram, hence  we  ignore  this  delay  in  our  calculation 
of  program  speedup.  Since  the  Instruction  Dispatch 
Unit  does  not  release  an  instruction  until  its  data 
is  available  in  a  Data  Register,  and  since  we  expect 
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transfers  of  data  from  a  Data  Register  to  a 
processor  to  take  much  less  than  the  time  needed 
to  execute  the  operation,  it  seems  reasonable  to 
ignore  data  fetch  time. 

5)  An  exception  to  Rule  (4)  was  the  MOVE  operation. 
Since  a  MOVE  is  nothing  more  than  a  fetch  followed 
by  a  store,  it  is  tempting  to  say  that  MOVEs  take 
no  time.  In  view  of  the  large  fraction  of  a  COBOL 
program  that  MOVEs  comprise,  however,  this  would 
not  be  realistic.  Thus,  we  used  a  time  of  one  unit 
for  a  MOVE  instruction.  Further,  we  assumed  that  a 
MOVE  CORRESPONDING  instruction  required  one  unit  of 
time  for  each  item  to  be  moved. 

6)  All  arithmetic,  comparison,  and  connective  operations 
were  assumed  to  take  a  single  unit  of  execution  time. 

7)  FORK  and  RELEASE  instructions  were  assumed  to  take 
no  execution  time.  This  is  justified  since  these 
instructions  go  directly  to  the  Address  Counter 
Coordinator  which  executes  them  while  the  Address 
Counter  continues  on  with  the  next  instruction. 

8)  QUIT  instructions  were  assumed  to  take  one  unit  of 
time  to  execute.  This  was  done  to  simulate  the 
delay  between  the  start  of  execution  of  a  QUIT 
and  the  time  the  Address  Counter  is  again  avail- 
able for  assignment. 
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9)  The  operation  of  TEST  instructions  was  simulated 
by  delaying  the  execution  of  an  interlocked  block 
until  the  preceding  Address  Counter  had  passed  an 
appropriate  RELEASE  instruction. 

10)  TRANSFORM  instructions  were  assumed  to  take  one 
unit  of  time  to  execute. 

11)  Those  instructions  containing  subscripted  vari- 
ables were  assumed  to  require  a  unit  of  time  for 
effective  address  calculation  in  addition  to  the 
execution  time  of  the  operation.  This  was  done 
to  allow  for  the  time  needed  to  transfer  the 
subscript  value  from  the  Data  Memory  to  an  Address 
Counter.  For  a  block  of  assignment  statements 
using  the  same  subscript  more  than  once,  the 
additional  unit  of  time  was  added  to  the  whole 
block  once,  rather  than  to  each  subscripted  state- 
ment in  the  block. 

12)  ON  and  AT  conditions  were  assumed  to  require  no 
time  to  execute  since  they  could  be  implemented 
as  interrupts  causing  a  branch  to  the  appropriate 
section  of  code  during  the  execution  of  the 
instruction  to  which  the  ON  or  AT  condition  was 
attached. 

13)  Each  exit  from  a  conditional  branch  was  assigned 
a  probability  of  execution  according  to  the 
following  criteria: 
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a)  Exits  leading  to  error  processing  were  assigned 
a  probability  of  zero.  While  these  paths  in  a 
program  are  undoubtedly  executed  in  the  real 
world,  they  should  normally  be  executed  in  yery 
small  proportion  to  the  non-error  processing. 

b)  Any  exit  leading  to  an  early  termination  of  the 
program  was  assigned  an  execution  probability 
of  zero. 

c)  For  some  programs,  information  was  available  to 
us  in  the  form  of  counter  values  accumulated 
during  the  actual  execution  of  the  program. 
From  these  counter  values,  rough  estimates  of 
execution  probabilities  for  some  paths  could 

be  made. 

d)  In  all  other  cases,  it  was  assumed  that  exits 
from  a  conditional  branch  instruction  were 
equally  probable. 

The  results  of  the  simulations  are  given  in  Table  6.6.  The 
following  items  are  tabulated  for  each  phase: 

#  STMT  -  Number  of  statements  found  within  the  phase. 

Note  that  this  is  a  static,  rather  than 

dynamic  value, 
t-,    -  The  amount  of  time  a  sequential  processor 

needs  to  process  one  record. 
T,     -  The  amount  of  time  a  sequential  processor 

needs  to  execute  the  segment  of  the  simulated 
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Table  6.6 
Speedup  Results 


Program         # 
ID    Phase  STMT 


B7510363 
15156040 

15156050 

15210030 
15212005 


SSN512 

S7510025 
S7550180 

S7550181 

S7550182 

S7550183 


1 

74 

34 

1 

105 

136 

2 

22 

15 

All 

127 

103 

1 

14 

11 

2 

16 

13 

All 

30 

12.3 

1 

50 

42 

1 

100 

49.5 

2 

36 

14.1 

3 

9 

9 

All 

145 

36.8 

1 

54 

37.5 

2 

18 

17 

All 

72 

32.9 

1 

4 

4 

1 

64 

34 

2 

5 

6 

All 

69 

30.0 

1 

40 

25 

2 

8 

10 

All 

48 

22.0 

1 

63 

65.5 

2 

29 

60 

All 

92 

63.5 

1 

93 

68.5 

2 

47 

45 

All 

140 

65.0 

34 

136p 

15p 

151p 

Up 
24 
llp+24 

42 

257.4 
14.1 
24 

295.8 

75 
17p 
17p+75 

4 

102 
6p 
6p+102 

50 

lOp 
10p+50 

131 
120 
251 

137 

45p 
45p+137 


12 

16 

6 

22 

5 

10 
15 


13.5 
7.0 
6.0 

26.5 

14 

4 

18 


36 

6 

42 

12 

3 

15 

9 

5 

14 

17 

3 

20 


max 


12p 

4 

12p 

4p 
2 

4p 


25 

5 

3 
25 

3 
0 
3 

1 

4 

2p 

2p 


P 

P 

3 
3 
3 

7 

2p 

2p 


1.3 

P 

P 

P 

P 
1.3 

3 

.7p+8.7 

1.9 

5.2 

1.1 

2.3 

3.5 

1.9 

0 

P 
.2p+1.5 

1 

1.4 

0 

P 
.lp+1.2 

1.6 

0 

P 
.2p+1.3 

1.5 

2.0 

1.7 

2.3 

0 

P 
.2p+2.0 

2.8 

8.5p 
2.5p 
6.9p 

2.2p 
2.4 
0.7p+1.6 

4.7 

19.1 
2.0 
4.0 

11.2 

5.4 

4.3p 
0.9p+4.2 

1 

2.8 

8p 
l.lp+2.4 

4.2 

3.3p 
0.7p+3.3 

14.6 
24.0 
17.9 

8.1 

15p 
2.3p+6.9 
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code  on  which  the  speedup  calculation  is 

based. 

T     -  The  amount  of  time  our  concurrent  processor 

needs  to  execute  the  same  segment  of  code 

which  generated  the  value  given  for  T-. . 

p     -  The  maximum  number  of  arithmetic  units 
rmax 

required  to  execute  the  simulated  code. 
a~     -  The  average  number  of  Address  Counters  in 

use  during  the  execution  of  the  program, 
n     -  The  maximum  number  of  Address  Counters 

needed  during  the  execution  of  the  program. 
S     -  The  average  speedup  found  from  the  ratio: 

s  -A- 

P    T. 


P 
In  addition,  these  items  are  also  tabulated  for  the  program  as  a  whole, 

In  generating  this  information,  however,  the  effect  of  the  execution 

time  of  links  was  ignored.  Compared  to  the  execution  time  of  a  phase 

we  expect  link  execution  time  to  be  negligible.  The  total  values, 

then,  are  for  all  of  the  phases  combined,  as  follows: 

N 

T>  =  ,£,  \ 

N 

P     fa        Pi 
Pm,v  =  maX  {Pm.v  1     (1  =  !»2>  •••>N) 
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a  = 


1   N  -  - 

T"  I     di  TP 

T  i=l   n  pi 


n  =  max  {n.}        (i  =  1 ,2,  . . . ,N) 

s  -  Ti 


P    T 
P 

where  N  =  number  of  phases  in  the  program.  We  have  assumed  that  the 
same  number  of  records  are  processed  by  each  phase.  The  factor  p 
appearing  in  some  of  the  entries  is  defined  as  the  minumum  of  the 
number  of  records  available  to  be  processed  and  the  number  of  Address 
Counters  available  to  direct  the  processing.  For  example,  if  there 
are  10,000  records  available  and  10  Address  Counters,  a  speedup  of  5p 
is  a  speedup  of  50. 
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7.  MACHINE  PARAMETERS 

To  complete  the  design  of  our  concurrent  record  processing 
machine,  we  present  in  this  chapter  the  values  of  various  parameters  for 
the  machine  design  proposed  in  Chapter  5. 

7.1  Speed  Limitation 

By  selecting  the  appropriate  numbers  of  memories  and  proces- 
sors, we  can  supply  any  memory  and  computational  bandwidth  needed.  To 
provide  the  necessary  instruction  fetch  rate  from  the  Address  Counters, 
we  can  use  replicated  interleaved  Program  Memories,  multiple-instruction 
fetches,  and  a  pipelined  Address  Counter  design.  Compared  to  the 
instruction  rates  handled  by  the  Data  Memory  and  the  processors,  the 
rate  handled  by  the  Address  Counter  Coordinator  should  be  quite  small. 
Thus,  none  of  these  units  should  limit  the  speed  of  our  machine. 

The  speed  of  the  Instruction  Dispatch  Unit,  however,  does 
impose  a  limitation  on  the  speed  in  our  design.  All  instructions  except 
branch  instructions  have  to  pass  through  it.  Pipelining  steps  (1) 
through  (7)  of  the  operation  of  this  unit  (section  5.4)  and  executing 
steps  (3)  through  (6)  in  parallel  by  maintaining  three  copies  of  the 
operand  address  in  the  Tag  Status  Register  Array  should  allow  us  [GUN70, 
VAD71]  to  process  a  single  instruction  in  on  the  order  of  250  nanosec- 
onds. With  a  five  stage  pipeline,  this  gives  us  a  dispatch  time  of 
50  nanoseconds  per  instruction.  Taking  one  microsecond  as  a  Primary 
Memory  cycle  time,  this  gives  us  a  throughput  rate  of  about  20  instruc- 
tions per  cycle. 
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7.2  Number  of  Address  Counters 

From  the  estimated  throughput  rate  for  the  Instruction 
Dispatch  Unit,  it  is  apparent  that  20  Address  Counters,  each  fetching 
an  instruction  per  cycle,  would  generate  the  maximum  load  the 
Instruction  Dispatch  Unit  could  handle.  In  order  to  have  a  power  of  two 
to  simplify  addressing,  we  choose  16  as  the  number  of  Address  Counters 
in  our  machine. 

7.3  Data  Memory  Word  Size 

As  a  result  of  our  analysis  of  variable  size  (Table  6.1)  and 
of  storage  allocation  (Table  6.5),  we  can  select  a  word  length  for  the 
Data  Memory.  Considering  Figure  6.1  we  note  that  there  are  peaks  at 
values  of  5,  10,  and  20  characters.  Clearly,  we  would  prefer  to  use 
these  sizes  rather  than,  say,  4,  8,  or  16.  We  note,  from  data  presented 
in  Table  6.5,  that  the  efficiency  of  memory  usage  is  lowest  for  a  word 
size  of  20  characters,  much  better  for  a  word  size  of  10  characters,  but 
only  marginally  improved,  beyond  that,  for  a  word  length  of  5  characters. 
The  choice  of  word  size  is  made  more  obvious  if  we  plot  the  percentage 
of  the  variables  in  our  sample  having  a  length  less  than  or  equal  to  a 
particular  size,  against  that  variable  size.  This  is  shown  in 
Figure  7.1.  As  can  be  seen  from  this  plot,  almost  80%  of  the  variables 
have  a  length  less  than  or  equal  to  5,  and  over  96%  have  a  length  less 
than  or  equal  to  10.  On  the  basis  of  these  statistics,  it  seems  appar- 
ent that  10  characters  per  word  is  a  good  Data  Memory  word  length. 
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Figure  7.1 
Cumulative  Variable  Size  Distribution 
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7.4  Data  Character  Size 

Since  the  number  of  bits  in  a  character  did  not  enter  into  any 
of  our  analyses,  we  lack  a  statistical  basis  from  which  we  can  derive  a 
good  character  size.  Instead,  we  present  the  following  remarks  as  jus- 
tification for  our  choice. 

Judging  from  the  programs  we  have  examined  and  from  discus- 
sions with  COBOL  programmers,  it  appears  that  data  items  are  most 
commonly  viewed  as  character  strings.  This  includes  alphabetic  informa- 
tion such  as  a  student's  name,  non-computational  numeric  data  such  as 
his  Social  Security  Number,  and  computational  data  such  as  the  number  of 
hours  of  credit  he  has  accumulated.  Furthermore,  even  high  usage 
numeric  data  such  as  record  counters  are  often  retained  in  character 
form.  During  the  execution  of  an  IBM/360  COBOL  program,  this  results  in 
much  transformation  of  data  from  character  form  to  packed  decimal  form 
and  back  again.  It  seems  apparent  that  the  use  of  a  single  representa- 
tion for  all  data  could  result  in  a  modest  improvement  in  execution 
speed  by  simply  avoiding  all  of  these  unnecessary  transformations. 
Doing  arithmetic  with  such  a  representation  should  prove  to  be  no  prob- 
lem [SCH72]. 

Another  consideration  in  determining  character  size  is  the 
number  of  characters  needed  in  the  character  set.  The  COBOL  programs  we 
examined  had  no  need  for  a  huge  set  of  characters.  Some  machines  [BUR67] 
executing  COBOL  programs  today  have  as  few  as  64  characters  in  their 
character  sets  without  ill  effect. 

Thus,  we  propose  that  our  machine  use  6-bit  characters.  Fur- 
ther, we  propose  that  this  be  the  only  way  data  is  encoded.  Our 
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arithmetic  units  use  the  low  order  four  bits  of  a  character  as  a  BCD 
encoding  of  a  number,  signalling  an  error  condition  if  the  high  order 
bits  are  not  the  proper  code  for  a  numeric  character. 

7.5  Number  of  Data  Memory  Units 

There  is  no  apparent  correlation  between  program  size  and  the 
number  of  memory  units  needed  for  the  program's  data.  This  is  shown  in 
Table  7.1  and  in  Figure  7.2.  It  is  apparent  that  16  memory  units  would 
be  adequate  for  all  but  a  few  programs.  The  data  for  these  few  programs 
could  be  made  to  fit  into  16  memories  at  some  sacrifice  in  execution 
speed  due  to  memory  access  conflicts.  However,  this  degradation  would 
be  mitigated  through  our  use  of  high  speed  Data  Registers  in  the  Data 
Memory.  Thus,  we  propose  that  16  Data  Memory  Units  be  used. 

7.6  Size  of  Data  Memory 

From  Table  7.1  we  find  that  the  number  of  words  needed  to  hold 
the  globally  accessable  data  is  small  compared  to  that  needed  for  locally 
accessable  data.  The  largest  number  of  words  needed  for  local  storage 
by  our  sample  programs  is  19  words  per  memory  unit.  Allowing  for  the 
size  of  our  programs  and  the  growth  of  memory  requirements  with  program 
size,  128  words  per  memory  unit  per  Address  Counter  seems  a  reasonable 
amount.  Programs  needing  more  than  this  will  not  be  able  to  use  all 
16  Address  Counters,  but  will  be  able  to  execute.  This  gives  a  require- 
ment  of  about  2K  words  of  local  memory.  In  terms  of  characters  and 
bits,  this  would  provide  20K  characters  and  120K  bits  of  storage.  For 


*1K  =  1024;  1M  =  1,048,576. 
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Tab 

le  7.1 

Memory  Requirements 

Program  ID 

# 
Cards 

Phase 

#  Wds 
Global 

§   Wds 

Local 

# 
Memories 

Local  Wds 
x  Memories 

B7510363 

305 

1 

3 

5 

13 

65 

15156040 

590 

1 
2 

3 
2 

12 
12 

36 
24 

15156050 

116 

1 
2 

2 
17 

4 
4 

8 
68 

15210030 

288 

1 

3 

50 

150 

15212005 

387 

1 
2 
3 

3 
3 

1 
3 

1 

9 
8 
8 

9 

24 
8 

SSN512 

150 

1 
2 

2 
2 

41 
40 

82 

80 

S7510025 

75 

1 

2 

12 

24 

S7550180 

301 

1 
2 

3 
2 

54 
54 

162 
108 

S7510181 

288 

1 
2 

10 
9 

11 
11 

no 

99 

S7550182 

174 

1 
2 

2 
2 

16 
12 

32 

24 

S7550183 

351 

1 
2 

19 
16 

17 
17 

323 
272 
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a  16  memory  unit  machine,  then,  we  have  32K  words,  320K  characters,  or 
1.9M  bits.  In  terms  of  8  bit  bytes,  this  would  be  slightly  more  than 
243K  bytes  of  data  storage  capacity. 

From  Table  7.1  it  appears  that  4  Data  Registers  per  Data 
Memory  Unit  per  Address  Counter  should  be  adequate.  Thus,  we  need  64 
Data  Registers  per  Data  Memory  Unit  for  a  total  of  1024  words  of  high 
speed  memory.  In  each  Data  Memory  Unit  we  also  need  64  words  x  11  bits 
of  content  addressable  Address  Memory. 

7.7  Size  of  Program  Memory 

In  order  to  calculate  the  size  of  a  Program  Memory  word,  we 
need  to  make  the  following  assumptions: 

1)  Each  instruction  can  contain  two  operand  addresses 
and  a  result  address. 

2)  Each  address  in  the  instruction  is  composed  of  a 
base  register  field,  an  index  register  field,  and 
a  displacement  field. 

3)  Each  Address  Counter  contains  8  registers  as  base 
or  index  registers,  requiring  6  bits  per  address. 

4)  The  maximum  displacement  necessary  is  250  charac- 
ters (25  words),  requiring  8  bits  per  address. 

5)  Eight  bits  are  sufficient  for  the  operation  code. 

6)  Six  bits  are  sufficient  to  contain  the  length  of 
each  operand  and  result. 

Each  operand,  then,  requires  20  bits  to  hold  the  address  and 
length.  This  gives  us  a  total  requirement  of  68  bits  per  instruction. 
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Since  not  all  types  of  instructions  have  two  operands  and  a 
result,  short-format  instructions  can  be  defined  which  use  up  to  34  bits 
and  are  stored  two  per  word. 

To  obtain  an  estimate  of  the  total  number  of  words  needed  in 
the  Program  Memory,  we  need  to  estimate  the  size  of  the  largest  programs 
we  can  reasonably  expect  a  user  to  attempt  to  run.  Table  7.2  gives  some 
information  which  was  obtained  from  the  Student  Data  Area  and  the 
Financial  Area  of  the  University  of  Illinois  Office  of  Administrative 
Data  Processing.  This  data  shows  that  there  are  few  yery   large  programs, 
with  none  of  these  large  programs  containing  more  than  6,000  cards.  In 
addition,  we  were  informed  that  large  programs  tended  to  be  overlayed  to 
reduce  the  memory  required  for  the  program.  Thus,  it  seems  reasonable 
to  assume  that  6,000  cards  produce  about  as  much  code  as  we  would  expect 
to  be  required  to  hold  in  the  Program  Memory  at  one  time. 

To  obtain  an  estimate  of  the  ratio  of  the  number  of  cards  per 
statement  in  a  large  program,  we  examined  our  original  sample  of  42  pro- 
grams and  computed  this  ratio  for  each  program  of  more  than  1,000  cards. 
The  average  value  of  this  ratio  was  approximately  1.5.  If  we  assume 
that  a  6,000  card  program  will  generate  a  statement  for  each  1.5  cards, 
this  implies  that  we  need  about  4,000  words  of  Program  Memory.  To  imple- 
ment a  memory  of  4,096  words  of  68  bits  per  word  requires  272K  bits. 
This  is  equivalent  to  a  34K  byte  memory. 

Another  part  of  the  Program  Memory  for  which  we  need  to  com- 
pute a  size  is  the  Cache.  For  most  programs,  we  would  like  the  Cache 
to  hold  an  entire  phase.  From  Table  6.4  we  find  none  of  the  programs 
we  analyzed  had  phases  which  exceeded  200  statements  in  total  size. 
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Table  7.2 
Program  Size  Statistics 

Student  Data  Area  programs  for  which  data  was  readily  available: 


Percent  of  Sample 

56 
28 
11 

3 

1 
>  1 

0 


Financial  Area  largest  programs  for  which  data  was  at  hand: 

3348  cards 

3293 

5627 


Program  Size 

Numbe 

<  500  cards 

292 

<  1000 

147 

<  2000 

60 

<  3000 

17 

<  4000 

5 

<  5000 

3 

>  5000 

0 
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However,  allowing  for  the  fact  that  our  sample  programs  are  small,  we 
propose  a  Cache  size  of  IK  words. 

7.8  Numbers  of  Processors 

7.8.1  IF  Tree  Processor 

Table  6.2  shows  that  IF  statements  form  about  a  quarter  of  the 
statements  in  our  sample.  If  we  assume  that  we  can  group  an  average  of 
five  IF  statements  per  IF  Tree,  which  is  a  fairly  conservative  assump- 
tion, then  roughly  5%  of  the  operations  executed  in  our  machine  will  be 
IF  Tree  execution  operations.  With  16  instruction  streams  active,  this 
implies  that  we  need  one  IF  Tree  Processor. 

7.8.2  Arithmetic/Logical  Processors 

For  the  Arithmetic/Logical  Processor  we  look  first  at 
Table  6.2  and  find  that  the  Arithmetic  and  IF  statements,  in  which  the 
operators  of  Table  6.3  are  found,  comprise  34.1%  of  the  program.  Com- 
paring the  number  of  operators  with  the  number  of  Arithmetic  and  IF 
statements,  we  find  an  average  of  1.2  operators  per  statement.   This 
implies  that  almost  41%  of  the  instructions  in  the  program  require  an 
Arithmetic/Logical  Processor.  Doing  a  similar  calculation  for  memory- 
only  operations  (MOVE,  Increment,  etc.),  we  arrive  at  an  estimate  that 
about  half  the  operations  involve  only  the  memory. 


* 
For  this  calculation  we  have  not  included  the  Increment 

operator  and  the  NUMERIC  and  ALPHABETIC  tests  which  are  executed  in 
the  Data  Memory. 
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Next,  we  consider  the  interaction  between  the  memories  and  the 
processors.  Using  access  times  and  cycle  times  of  an  existing  machine 
as  a  guide  [IBM73a,  IBM73b],  it  seems  reasonable  to  assume  the  following 
operation  times: 

Fetch  from  Primary  Memory 

to  a  Data  Register  1000  nsec. 

Transfer  between  Data  Register 

and  processor  200  nsec. 

Add  instruction  time  600  nsec. 

With  operation  times  on  this  order,  it  is  apparent  that  a  pair  of  memory 
units  could  keep  a  pair  of  processors  supplied  with  data  and  still  have 
time  for  two  to  three  memory-only  operations  per  Primary  Memory  cycle. 
This  is  roughly  the  40%: 50%  proportion  of  Arithmetic/Logical  to  memory- 
only  operations  we  found  in  the  first  part  of  this  section.  Thus,  there 
should  be  approximately  the  same  number  of  Arithmetic/Logical  Processors 
as  Data  Memory  Units.  To  make  the  total  number  of  Arithmetic/Logical 
and  IF  Tree  Processors  a  power  of  two,  we  choose  15  as  the  number  of 
Arithmetic/Logical  Processors. 

7.8.3  1/0  Processors 

For  most  programs  we  would  like  to  be  able  to  assign  an  1/0 
Processor  to  each  file  in  use  in  a  phase.  For  our  sample  of  42  programs 
we  found  that  the  average  number  of  files  used  per  phase  was  3.4,  with 
a  maximum  of  nine.  Since  the  larger  programs  tended  to  use  more  files, 
about  six  1/0  Processors  are  needed.  To  get  a  power  of  two  for  address- 
ing purposes,  we  select  eight  as  the  number  of  1/0  Processors. 
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7.8.4  SORT  Processor 

None  of  the  programs  we  examined  had  more  than  one  file  being 
sorted  at  one  time.  Thus,  one  sorting  network  should  be  sufficient. 

7.9  Instruction  Dispatch  Unit  Memory  Sizes 

There  are  two  memories  in  the  Instruction  Dispatch  Unit  whose 
sizes  we  must  find.  These  are  the  Tag  Status  Register  Array  and  the 
Instruction  Waiting  Register  Array.  The  following  analysis  yields  these 
sizes: 

Let  1  unit  time  =  1  memory  cycle  time 
=  1  add  time 


Assume: 


1)  n  =  number  of  Address  Counters 

2)  m  =  number  of  memory  units 

3)  p  =  number  of  Arithmetic/Logical  Processors  +  number 

of  IF  Tree  Processors 

4)  Each  Address  Counter  issues  r  instructions  per  unit  time 
at  highest  speed. 

5)  Each  Address  Counter  is  fetching  instructions  for  a 
fraction  f f  <_  1  of  the  time. 

6)  The  instruction  stream  contains  the  following  fractions 
of  each  instruction  type: 

f-    Arithmetic 

fM    Memory-only 

fc    Address  Counter  Control 


128 


fR     Branch 

f.     Conditional  Branch 

fA+fM+fC+fB+fI=1  .       f7-9"1' 

7)  The  number  of  operands  for  instructions  reaching 
the  Instruction  Dispatch  Unit  are: 

3     Arithmetic 

2     Memory 

1     Address  Counter  Control 

8)  Average  holding  times  for  tags  are: 

Arithmetic  instructions 

Sources    2 

Results    3 
Memory-only  instructions 

Sources     1 

Results    2 

9)  Average  holding  times  in  the  Instruction  Dispatch 
Unit  for  instructions  are: 

Arithmetic     2  (1  fetch,  1  previous  operation) 
Memory        1  (1  previous  operation) 
10)  Consider  a  condition  in  which  all  Address  Counters 
are  active,  with  conditional  and  unconditional  branch 
instructions  being  encountered,  but  no  interlocks 
inhibiting  any  of  the  Address  Counters. 
Considering  the  machine  to  be  represented  by  the  model  in 
Figure  7.3,  we  define: 
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Xf  =  rate  at  which  instructions  are  being  fetched  from 
Program  Memory  by  each  Address  Counter 
-   f -  x  (maximum  fetch  rate) 


X.  =  rate  at  which  instructions  are  being  transferred  from 
each  Address  Counter  to  the  Instruction  Dispatch  Unit 


A,  =  rate  at  which  instructions  reach  the  Instruction 
Dispatch  Unit 
=  rate  at  which  instructions  leave  the  Instruction 
Dispatch  Unit 


Ac  =  rate  at  which  Address  Counter  Coordinator  instructions 
leave  the  Instruction  Dispatch  Unit 


Ap  =  rate  at  which  processor  instructions  leave  the 
Instruction  Dispatch  Unit 


A  =  rate  at  which  processor  instructions  reach  each 
processor 


AM  =  rate  at  which  memory-only  operations  leave  the 
Instruction  Dispatch  Unit 


A  =  rate  at  which  memory-only  operations  reach  each 

Data  Memory  Unit 
Since  we  are  assuming  that  interlock  instructions  are  not 
encountered,  we  neglect  the  contribution  of  Ap. 
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Since  each  processor  is  capable  of  executing  an  operation  each 
cycle, 

Ap  <  p  operations  per  cycle. 
Since  each  Data  Memory  Unit  can  handle  about  twice  as  many  memory-only 
operations  each  cycle  (in  addition  to  processor  operations)  as  there  are 
memories 

XM  <_  2m  operations  per  cycle. 
Thus, 

L  <  2m  +  p  operations  per  cycle.     (7.9-2) 
Comparing  this  result  with  the  discussions  of  sections  7.1,  7.6,  and  7.8; 
we  see  that 

Xj  =  20 

m  =  16 

p  =  16 
which  does  satisfy  relation  7.9-2. 

Since  we  are  assuming  that  interlocks  do  not  interfere  with 
Address  Counter  functioning,  the  only  things  that  do  interfere  are 
branches,  both  conditional  and  unconditional.  If  we  assume  that  a  con- 
ditional branch  takes  one  unit  of  time  to  be  resolved,  then  a  fraction 
1  -  fx  -  fB  of  the  instructions  fetched  are  passed  on  to  the 
Instruction  Dispatch  Unit.  Thus, 

xi  -  (1  -  f,  -  fB)  xf 

negl . 

■  (fA  +  fM  +  X>  Xf 

Since  the  instruction  streams  coming  from  each  Address  Counter  are  inde- 
pendent (by  our  assumption  of  no  interlock  activity,  the  rate  at  which 
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instructions  reach  the  Instruction  Dispatch  Unit  is  n  times  the  rate 
instructions  are  issued  by  an  Address  Counter. 

h  =  nAi 

=  nXf(fA  +  fM)  (7.9-3) 

The  rate  at  which  instructions  leave  the  Instruction  Dispatch  Unit  is 
equal  to  the  rate  at  which  they  arrive.  Clearly,  this  must  be  true  or 
infinite  queues  would  be  needed.  From  relation  7.9-3  the  arrival  rate 
for  tag  requests  is 


AT  =  nXf (3fA  +  2fM) 


using  assumption  (7). 


The  number  of  tags  in  use  at  any  point  in  time  is 

0 
t    Y  (number  of  arrivals  at  t.) 


nt=  y        i=k 


types  of 
requests 

where  t.  is  the  smallest  average  time  such  that  there  are  no  tags 

remaining  at  time  zero  from  those  that  arrived  t.  units  of  time  earlier. 

From  assumption  (8)  we  have 

n(3  +  3  +  l)(3f.Xf)     n(2  +  D(2f„Xf) 

1  3  2 

NT  =  7fAXfn  +  3fMXfn 

=  nXf(7fA  +  3fM)  (7.9-4) 

In  a  similar  manner  we  find  that  the  number  of  instructions 

waiting,  per  unit  time,  for  their  operands  to  become  available  is 

Nj  =  nXf(fA  +  fM)  (7.9-5) 
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We  have  previously  calculated  the  following  values: 
n  =  16  Address  Counters 


fA 

~ 

0.4 

fM 

~ 

0.5 

Xl 

= 

20  operations 

per 

cycle 

From 

equation  7. 

9 

-3  we  find 

v 

= 

XI 

n(fA  +  fM) 

= 

20 

16(.9) 

=  1.4  operations  per  cycle  per  Address  Counter, 
From  equation  7.9-4  we  then  have 

NT  =  16(1. 4)(7  x  0.4  +  3  x  0.5) 
=96.3  tags  in  use. 
From  equation  7.9-5  we  find 

Nj  =  16(1.4)(0.9) 

=20.2  instructions  waiting. 
Clearly,  128  Tag  Status  Registers  and  32  Instruction  Waiting 
Registers  are  adequate.  The  total  sizes  of  these  two  memories  are  found 
as  follows: 

1)  For  each  Tag  Status  Register  we  need  three  content 
addressable  address  fields,  each  of  which  is  com- 
prised of  15  bits  (log2  (32K)).  In  addition  we 
need  a  128  word  x  48  bit  content  addressable  memory 
for  the  Tag  Status  Register  Array. 
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2)  Each  Instruction  Waiting  Register  is  comprised  of 
the  following  fields: 

a)  Instruction  Operation  Code     (    8  bits) 

b)  Operand  and  result  addresses   (3  x  15  bits) 

c)  Operand  and  result  lengths     (3x6  bits) 

d)  Operand  and  result  tags       (3x5  bits) 

e)  Status  bits  (     2  bits) 
This  gives  a  total  of  88  bits  of  which  the  15  tag 
storage  bits  must  be  content  addressable.  Thus, 

the  Instruction  Waiting  Registers  can  be  constructed 
of  32  words  x  73  bits  of  random  access  memory  and 
32  words  x  15  bits  of  content  addressable  memory. 

7.10  Other  Devices 

Within  each  unit  of  the  machine  there  are  a  number  of  devices 
which  we  have  not  yet  discussed.  To  allow  us  to  estimate  in  section  7.11 
the  number  of  packages  needed  to  build  the  machine,  we  briefly  describe, 
in  this  section,  the  circuitry  for  each  unit.  Not  included  in  this 
section,  however,  are  the  control  circuitry  for  each  unit  and  the  bus 
drivers  and  receivers  needed  for  the  various  control  signals  sent 
between  units.  These  units  are  included  in  the  calculations  of 
section  7.11 . 

7.10.1  Program  Memory 

In  addition  to  the  memory  itself  we  need  the  following  devices 
in  the  Program  Memory: 
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Queue 

-  16  words  x  16  bits  to  hold  instruction  addresses 
(12  bits)  and  Address  Counter  numbers  (4  bits)  in 
the  Fetch  Queuing  and  Routing  Unit. 

Decoder 

-  4  bits  of  l-of-16  for  routing  instructions  to  the 
proper  Address  Counter. 

Bus  Drivers  and  Receivers 

-  68  drivers  for  sending  instructions  to  Address 
Counters. 

-  12  receivers  for  instruction  addresses. 


7.10.2  Address  Counter 


Each  Address  Counter  requires  the  following  devices: 
Incrementer 

-  One  12  bit  counter. 
Registers 

-  One  12  bit  Program  Address  Register. 

-  One  68  bit  Memory  Buffer. 

-  Eight  15  bit  index  registers. 
Decoder 

-  3  bits  to  l-of-6  for  the  Op  Code  Decoder. 
Matcher 

-  4  bit  matcher  for  the  Address  Counter  ID  Match 
Unit. 
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Adder 

-  Three  15  bit  adders  for  the  Address  Calculation 
Unit. 

Bus  Drivers  and  Receivers 

-  12  drivers  for  sending  the  program  address  to 
the  Program  Memory. 

-  71  drivers  for  sending  the  instruction  to  the 
Instruction  Dispatch  Unit. 

-  15  drivers  for  transferring  index  registers  to 
the  Address  Counter  Coordinator. 

-  68  receivers  for  instructions  from  the  Program 
Memory. 

-  12  receivers  for  the  initiation  point  address 
from  the  Address  Counter  Coordinator. 

-  15  receivers  for  index  register  values  from 
the  Address  Counter  Coordinator. 

-  19  receivers  for  the  Index  Bus. 

7.10.3  Instruction  Dispatch  Unit 

In  addition  to  the  Instruction  Waiting  Registers  and  the  Tag 
Status  Register  Array,  the  following  devices  are  needed  in  the 
Instruction  Dispatch  Unit: 

Queues 

-  16  words  x  71  bits  for  the  Arriving  Instruction 
Queue. 

-  48  words  x  5  bits  for  the  Tag  Queue. 
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-  8  words  x  76  bits  for  the  Processor  Instruction 
Queue. 

Registers 

-  One  16  bit  register  for  processor  status 
information. 

-  Five  71  bit  registers  used  for  pipelining  the 
operation  of  the  Fetch  and  Tag  Generator. 

Bus  Drivers  and  Receivers 

-  76  drivers  for  the  Memory  Operation  Bus. 

-  71  drivers  for  sending  instructions  to  the 
Address  Counter  Coordinator. 

-  76  drivers  for  the  Processor  Operation  Bus. 

-  71  receivers  for  instructions  from  the  Address 
Counters. 

-  5  receivers  for  the  Tag  Bus. 

7.10.4  Address  Counter  Coordinator 

A  detailed  design  for  this  unit  was  not  presented  in 
Chapter  5;  thus  it  is  difficult  to  specify  precisely  the  necessary 
devices  for  the  unit.  If  we  assume  an  implementation  completely  of 
hardware,  however,  it  is  apparent  that  devices  of  the  following  types 
would  be  needed: 

Queues 

-  8  words  x  ~  32  bits  for  the  FORK  queue. 

-  2  words  x  ~  32  bits  for  the  HOLD  queue. 
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Registers 

-  Sixteen  16  bit  registers  for  interlock  status 
information. 

-  Three  16  bit  registers  for  Address  Counter 
status  information. 

-  Sixteen  8  bit  registers  for  predecessor/successor 
information. 

-  Nine  15  bit  registers  for  saving  index  registers 
for  a  HOLDing  Address  Counter. 

-  Fifteen  4  bit  registers  for  the  list  of  inter- 
locks whose  RELEASE  is  being  awaited. 

Bus  Drivers  and  Receivers 

-  12  drivers  for  sending  an  initiation  point 
address  to  an  Address  Counter. 

-  15  drivers  for  sending  index  register  values. 

-  15  receivers  for  receiving  index  register  values 
from  an  Address  Counter. 

-  71  receivers  for  instructions  from  the 
Instruction  Dispatch  Unit. 

7.10.5  Data  Memory  Unit 

The  Data  Memory  Unit  includes  the  following  devices: 
Queue 

-  2  words  x  76  bits  for  the  Memory  Operation  Bus. 
Incrementer 

-  A  10  BCD  digit  counter. 
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Match  Circuits 

-  Ten  6  bit  matchers  in  the  Function  Logic  for 
EXAMINE  and  TRANSFORM  operations. 

Register 

-  One  6  bit  register  associated  with  the  match 
circuits. 

Bus  Drivers  and  Receivers 

-  19  drivers  for  the  Index  Bus. 

-  60  drivers  for  the  Inter-Memory  Bus. 

-  60  drivers  to  the  Routing  Network. 

-  60  drivers  for  the  I/O  Bus. 

-  5  drivers  for  the  Tag  Bus. 

-  60  receivers  for  the  Inter-Memory  Bus. 

-  60  receivers  from  the  Routing  Network. 

-  60  receivers  for  the  I/O  Bus. 

-  76  receivers  for  the  Memory  Operation  Bus. 

7.10.6  Routing  Network 

The  Routing  Network  can  be  implemented  as  a  16  x  16  crossbar 
switch.  For  60  lines  per  path  through  the  network  we  then  need  a 
total  of  15,360  crosspoints  in  the  network. 

7.10.7  Arithmetic/Logical  Unit 

Each  Arithmetic/Logical  Unit  has  the  capability  of  operating 
on  a  pair  of  10  digit  BCD  numbers.  In  addition,  logic  is  included  to 
interpret  two  low  order  digits  as  an  exponent  to  simulate  floating  point 
operations.  This  requires  the  following  devices  for  each  of  these  units: 
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Adder 

-  10  BCD  digits. 
Multiplier 

-  10  BCD  digits. 
Logical  Unit 

-  Ten  6  bit  characters. 
Shift  Register 

-  8  BCD  digits. 

Bus  Drivers  and  Receivers 

-  60  drivers  to  send  to  the  Routing  Network. 

-  60  receivers  to  receive  from  the  Routing  Network. 

-  76  receivers  to  receive  from  the  Instruction 
Dispatch  Unit. 

7.10.8  IF  Tree  Processor 

The  design  of  the  IF  Tree  Processor  has  been  detailed  by  Davis 
[DAV72b:  page  32]  and  is  not  repeated  here. 

7.10.9  1/0  Processor 

No  design  for  an  I/O  Processor  is  given  in  Chapter  5.  In 
order  to  arrive  at  an  estimate  of  the  hardware  required  for  each  of 
these  units,  we  first  examine  the  throughput  required  for  a  unit. 

From  Table  6.6  we  find  that  the  average  time  spent  processing 
a  record  for  phases  able  to  use  p  Address  Counters  is  on  the  order  of 
10  units  of  time.  During  this  time  we  can  have  16  records  being  pro- 
cessed concurrently.  If  we  assume  that  each  READ  statement  results  in 
the  transfer  to  the  Data  Memory  of  the  equivalent  of  an  entire  card 
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image  of  useful  data,  then  we  must  transfer  480  bits  for  each  READ.  If 

we  assume  that  a  unit  of  time  is  about  1  microsecond,  we  have 

R  =  (480  bits/READ)(16  READs) 
10  ysec 

=  768  M  bits/sec 
for  the  rate  at  which  an  I/O  Processor  must  supply  data  to  the  Data 
Memory.  Transmitting  all  data  in  parallel,  the  I/O  Processor  must  exe- 
cute 1.6  M  transmissions  per  second.  With  all  eight  I/O  Processors 
active  (a  rare  occurrence),  the  I/O  Bus  must  handle  12.8  M  transmissions 
per  second.  If  three  quarters  of  each  record  read  is  discarded  by  the 
I/O  Processor,  then  each  file  must  supply  data  to  its  I/O  Processor  at 
four  times  the  rate  the  I/O  Processor  supplies  data  to  the  Data  Memory. 
The  data  rate  required  from  a  file  is  then  6.4  M  transmissions  per 
second  or  3072  M  bits  per  second.  We  can  achieve  this  transfer  rate  if 
we  can  obtain  large  head-per-track  disks  capable  of  reading  five  words 
(300  tracks)  in  parallel  at  a  rate  of  10  M  bits  per  second.  Such  disks 
are  not  yet  available,  but  bulk  storage  devices  which  have  the  necessary 
bandwidth,  notably  those  built  of  semiconductor  devices,  are  becoming 
feasible. 

If  we  can  obtain  large  head-per-track  disks  which  can  supply 
data  at  high  rates,  then  the  amount  of  buffer  storage  needed  in  the  I/O 
Processor  can  be  quite  small.  We  will  assume  for  the  purposes  of  this 
section  that  the  I/O  Processor  is  composed  of  a  register  for  the  cur- 
rent five  words  of  the  current  record,  a  register  for  data  to  be 
transferred  to  the  Data  Memory,  a  memory  containing  record  format 
information,  and  a  network  to  route  ten  characters  at  a  time  to  one  of 
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the  words  in  the  buffer  register.  On  this  basis  it  appears  that  the 
following  devices  are  needed  for  each  I/O  Processor: 
Memory 

-  256  60  bit  words  of  format  memory. 
Registers 

-  One  300  bit  record  buffer. 

-  One  960  bit  data  buffer. 
Crosspoints 

-  A  50  x  50  array  of  crosspoints  where  each 
crosspoint  carries  six  bits  in  parallel. 

Bus  Drivers  and  Receivers 

-  960  drivers  for  the  I/O  Bus  to  supply  all 
16  Data  Memory  Units  simultaneously. 

-  300  drivers  to  send  data  to  a  file. 

-  960  receivers  for  the  I/O  Bus. 

-  300  receivers  to  receive  data  from  a  file. 

7.10.10  SORT  Network 

As  in  the  case  of  the  I/O  Processor,  no  design  has  been  pro- 
posed for  the  SORT  Network.  Consequently,  only  a  rough  estimate  of  the 
required  hardware  is  given  here.  By  supplying  the  SORT  Network  with 
data  from  the  Data  Memory  Units,  we  can  sort  16  words  at  a  time.  From 
Batcher's  results  [BAT68]  the  number  of  comparators  needed  to  sort  2P 
numbers  is 

N  =  (p2  +  p)  2(p"2) 

=  (16  +  4)(4) 

=  80 
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for  our  case,  arranged  in  10  levels  with  8  comparators  in  each  level. 
Thus  we  need  eighty  60  bit  comparators  in  the  network.  We  also  need 
drivers  and  receivers  for  sixteen  60  bit  words. 

7.11  Package  Counts 

In  this  section  we  present  an  estimate  of  the  number  of  pack- 
ages of  circuitry  needed  to  build  this  machine.  The  packages  we  use 
for  these  counts  are  the  Dual  In-line  Packages  available  from  a  number 
of  manufacturers.  For  those  cases  for  which  appropriate  devices  exist, 
we  have  used  them  for  these  counts.  For  those  cases  for  which  we  found 
no  existing  device,  we  estimated  the  number  of  packages  over  which  it 
would  be  necessary  to  split  the  device.  Further  comments  are  included 
in  the  discussion  of  each  type  of  device. 

7.11.1  Memories 

For  the  purposes  of  counting  memory  packages  we  have  different 
package  sizes.  For  slow  bulk  memory  we  assume  a  package  contains  up  to 
a  4096  x  1  bit  array.  For  fast  register  memory  we  assume  a  1024  x  1  bit 
array  per  package.  For  Content  Addressable  Memory  we  assume  a  16  x  1  bit 
array  per  chip.  For  a  small  memory,  such  as  the  Data  Register  set  in  a 
Data  Memory  Unit,  we  use  a  64  x  4  bit  array.  For  individual  registers 
we  use  a  4  bit  register  chip.  Table  7.3  gives  a  summary  of  the  memory 
package  requirement.  Totaling  the  number  of  packages,  we  have  the 
following  requirements: 

Slow  Memory  1028  packages 

Fast  Memory  3927  packages 

Content  Addressable  Memory     1120  packages 
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Table 

7.3 

Memory  Package  Requirements 

Unit 

Memory 
Size 

Memory 
Type 

Pkgs 

per 

Unit 

# 
Units 

# 
Pkgs 

Program  Memory 

4096 

X 

68 

Slow 

68 

1 

68 

1024 

X 

68 

Fast 

68 

1 

68 

Data  Memory 

2048 

X 

60 

Slow 

60 

16 

960 

64 

X 

60 

Fast 

15 

16 

240 

64 

X 

11 

CAM 

44 

16 

704 

6 

X 

1 

Fast 

2 

16 

32 

Address  Counter 

12 

X 

1 

Fast 

3 

16 

48 

68 

X 

1 

Fast 

17 

16 

272 

15 

X 

8 

Fast 

2 

16 

32 

Instruction  Dispatch 

128 

X 

48 

CAM 

384 

384 

Unit 

73 

X 

32 

Fast 

16 

16 

15 

X 

32 

CAM 

32 

32 

16 

X 

1 

Fast 

4 

4 

71 

X 

5 

Fast 

4 

4 

Address  Counter 

16 

X 

16 

Fast 

4 

4 

Coordinator 

16 

X 

3 

Fast 

1 

1 

8 

X 

16 

Fast 

4 

4 

15 

X 

9 

Fast 

2 

2 

4 

X 

15 

Fast 

4 

4 

Ari  thmeti  c/Logi  cal 
Unit 

8  digit 
(shift 

x   1 

reg.) 

Fast 

8 

15 

120 

IF  Tree  Processor 

255 

X 

1 

Fast 

64 

1 

64 

12 

X 

1 

Fast 

3 

1 

3 

34 

X 

1 

Fast 

9 

1 

9 

I/O  Processor 

256 

X 

60 

Fast 

60 

8 

480 

1 

X 

300 

Fast 

75 

8 

600 

1 

X 

960 

Fast 

240 

8 

1920 
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If  we  can  obtain  16  bit  register  chips  for  the  I/O  Processor  registers, 
then  the  number  of  packages  needed  for  fast  memory  would  be  reduced  to 
2037  packages. 

7.11.2  Queues 

In  order  to  allow  various  units  in  the  machine  to  operate 
asynchronously  from  one  another,  we  have  included  a  number  of  queues. 
For  the  purposes  of  Table  7.4,  we  have  assumed  that  a  single  package 
contains  four  bits  of  storage  plus  logic  to  control  the  first-in/ 
first-out  operation  of  the  queue.  From  Table  7.4  it  can  be  seen  that 
we  need  1072  of  these  packages. 

7.11.3  Bus  Drivers  and  Receivers 

Because  we  have  a  number  of  buses  connecting  units  in  the 
machine,  we  are  able  to  have  a  number  of  items  of  data  and  a  number  of 
instructions  in  transit  simultaneously.  For  this  same  reason  we  require 
a  number  of  bus  drivers  and  receivers.  We  assume  that  we  can  have  four 
driver/ receivers  per  package  for  the  purposes  of  Table  7.5.  From  this 
table  we  find  that  we  require  5349  packages. 

7.11.4  Other  Devices 

A  number  of  other  devices  are  needed  to  complete  the  machine. 
These  are  summarized  in  Table  7.6.  The  number  of  packages  needed  for 
these  other  devices  is  7380. 

7.11.5  Total  Package  Requirement 

Adding  together  the  numbers  of  packages  computed  in  this 
section,  we  find  that  we  need  19,876  packages  for  the  devices  we  have 
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Table  7.4 
Queue  Package  Requirements 


Unit 

Queue  Size 

# 
Pkgs/Unit 

# 
Units 

# 
Pkgs 

Program  Memory 

16  x  16 

64 

64 

Instruction  Dispatch 
Unit 

16  x  71 
48  x  5 

72 
96 

72 
96 

8  x  76 

152 

152 

Address  Counter 
Coordinator 

8  x  32 
2  x  32 

64 
16 

64 
16 

Data  Memory  Unit 


2  x  76 


38 


16 


608 
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Table  7.5 
Bus  Driver  and  Receiver  Package  Requirements 


Unit 

Per 
#  Drivers 

Unit 
#Re 

iceivers 

#  Units 

Total 
Packages 

Program  Memory 

68 

12 

1 

20 

Address  Counter 

98 

114 

16 

800 

Address  Counter 
Coordinator 

27 

86 

1 

25 

Instruction  Dispatch 
Unit 

223 

76 

1 

75 

Data  Memory  Unit 

185 

196 

16 

1120 

Arithmetic/ Logical 
Unit 

60 

136 

15 

510 

I/O  Processor 

1260 

1260 

8 

2520 

SORT  Network 

960 

960 

1 

240 

IF  Tree  Processor 

19 

36 

1 

39 
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discussed.  This  does  not,  however,  include  control  logic  or  signal  paths 
between  units.  To  allow  for  this  we  assume  that  this  logic  requires  half 
again  as  many  packages  as  we  have  counted.  Thus  we  arrive  at  a  total 
package  count  of  about  30,000  packages  for  this  machine.  Table  7.7 
summarizes  the  counts  generated  in  this  section.  The  parenthesized 
values  include  control  logic,  while  the  other  values  are  the  ones  gener- 
ated in  this  section. 
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Table  7. 

7 

Total 

Package  Req 

uirements 

Unit 

Memory 
Packages 

Queue 
Packages 

Driver 
Packages 

Other 
Packages     Pc 

Total 
ickages 

Program  Memory 

136 
(  204) 

64 

(  96) 

20 

(  30) 

4  . 
(    8)      1 

224 
'  336) 

Data  Memory 

1936 
(2904) 

608 
(  912) 

1120 
(1680) 

320 
(480)      1 

3984 
[  5976) 

Address 
Counters 

352 
(  528) 

800 
(1200) 

224 
(  336)      1 

1376 

;  2064) 

Inst.  Disp. 
Unit 

440 
(  660) 

320 
(  480) 

75 
(  113) 

835 
[  1253) 

Addr.  Cntr. 
Coordinator 

15 
(  23) 

80 
(  120) 

25 

(  38) 

120 
;  180) 

Arith. /Logical 
Units 

120 
(  180) 

510 
(  765) 

2250 
(  3375)      1 

2880 
;  4320) 

IF  Tree 
Processor 

76 
(  114) 

39 
(  59) 

150 
(  225)      1 

265 
;  398) 

I/O  Processors 

3000 
(4500) 

2520 
(3780) 

1872 
(  2808)      1 

7392 
[11088) 

SORT  Processor 

240 
(  360) 

1600 
(  2400)      1 

1840 
[  2760) 

Routing  Network 

960 
(  1440)      1 

960 
;  1440) 

Total  Packages 

6075 
(9113) 

1072 
(1608) 

5349 
(8024) 

7380 
(11070)      ( 

19876 
!29815) 

151 


8.   PROBLEM  PROGRAMS 

8.1  Characteristics  of  Problem  Programs 

In  examining  the  results  of  the  analyses  of  a  number  of  COBOL 
programs,  we  find  that  phases  of  programs  fall  into  three  classes.  The 
first  class  consists  of  those  phases  which  are  parallel  processable. 
These  phases  yield  the  best  results  for  our  method  of  concurrent  pro- 
cessing. The  second  class  of  programs  consists  of  those  which  contain 
sequential  constraints  but  which  do  allow  a  number  of  Address  Counters 
to  concurrently  process  data.  While  the  speedup  of  these  phases  is  not 
as  spectacular  as  in  the  first  class,  our  method  does  yield  an  apprecia- 
ble improvement  over  single  Address  Counter  execution.  The  third  class, 
which  we  examine  in  this  chapter,  consists  of  phases  which  allow  yery 
little  concurrent  processing  and  consequently  yery   little  speedup. 

Some  of  the  phases  are  in  this  last  class  because  they  contain 
a  large  number  of  Reference  Dependent  variables  and  the  resulting  inter- 
lock constraints.  There  is  little  we  can  do  for  these  phases  without 
completely  redesigning  them.  Some  phases,  however,  are  encumbered  only 
by  the  presence  of  a  few  Reference  Dependent  variables.  It  is  possible 
in  some  of  these  phases  to  improve  their  speedup  by  making  a  change  in 
one  of  the  restrictions  imposed  in  section  2.2. 

8.2  Speeding  Up  Problem  Programs 

The  change  in  our  method  needed  to  accomplish  this  improve- 
ment is  to  allow  each  Address  Counter,  when  necessary,  the  ability  to 
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examine  the  contents  of  its  immediate  prodecessor's  Local  variable  set. 
This  ability  could,  of  course,  be  extended  to  allow  access  to  Local 
variables  between  Address  Counters  at  any  fixed  separation.  As  an  exam- 
ple of  the  usefulness  of  this  ability,  consider  the  following  program 
fragment: 

1  LOOP. 

2  READ  IN-FILE  AT  END  GO  TO  DONE. 

3  IF  IN-SEQ  >  OLD-SEQ 

4  THEN  MOVE  CORRESPONDING  IN-DATA  TO  OUT-DATA 

5  WRITE  OUT-DATA 

6  ELSE  DISPLAY  IN-SEQ  ' SEQ  ERROR'. 

7  MOVE  IN-SEQ  TO  OLD-SEQ. 

8  GO  TO  LOOP. 

Under  the  constraint  that  each  Address  Counter  has  access  only  to  its 
own  Local  variable  set,  this  loop  is  completely  sequential.  The  problem 
is  that  OLD-SEQ  cannot  be  accessed  by  Address  Counter  i  until  Address 
Counter  i-1  has  set  its  proper  value.  Relaxing  this  constraint,  let  us 
substitute  IN-SEQ.  ,   for  OLD-SEQ,  and  use  IN-SEQ.  where  the  program 
indicates  IN-SEQ.  Now,  by  allowing  Address  Counter  i  to  access 
IN-SEQ.  ,  we  can  process  the  input  records  concurrently. 

In  order  to  allow  an  Address  Counter  to  access  its  predeces- 
sor's Local  variable  set,  the  following  conditions  must  exist: 
1)  It  must  be  possible  to  replace  the  Reference 
Dependent  variable  causing  the  problem  (OLD-SEQ 
in  the  example)  with  the  same  Local  variable  in 
all  cases. 
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2)  The  phase  being  transformed  by  replacement  must 
contain  only  one  primary  READ  statement. 

3)  Every  path  within  the  phase  must  contain  an  assign- 
ment of  the  value  of  the  Local  variable  to  the  vari- 
able we  want  to  remove. 

4)  There  is  no  assignment  statement  in  the  phase  for 
which  the  Local  variable  is  an  output  variable. 

These  conditions  can  be  identified  as  follows: 

1)  Examine  all  occurrences  of  a  Reference  Dependent 
variable  as  an  output  variable.  If  it  is  always 
assigned  a  value  from  the  same  Local  variable,  the 
Reference  Dependent  variable  is  a  candidate  for 
removal . 

2)  The  presence  of  only  one  primary  READ  statement  can 
be  detected  during  the  source  text  scan  (section  3.1) 

3)  Each  path  in  the  phase  must  be  traced,  starting  at 
the  primary  READ.  If  the  primary  READ  is  again 
encountered,  this  attempt  at  replacement  must  be 
abandoned.  If  a  statement  causing  the  assignment 
of  the  value  of  the  Local  variable  to  the  Reference 
Dependent  variable  is  reached,  the  path  satisfies 
condition  (3)  above. 

4)  By  finding  that  the  set  of  output  references  for  the 
Local  variable  is  empty,  we  know  that  no  assignment 
is  made  to  the  Local  variable,  satisfying 
condition  (4)  above. 
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If  the  Reference  Dependent  variable  can  be  removed,  we  can 
delete  the  statements  in  which  the  variable  to  be  removed  occurs  as  an 
output  variable.  In  each  statement  in  which  the  removed  variable 
appears  as  an  input  variable,  we  substitute  the  preceding  Address 
Counter's  copy  of  the  Local  variable.  To  give  an  Address  Counter  access 
to  its  predecessor's  copy  of  the  Local  variable,  we  use  a  different 
index  register  for  the  base  address  of  the  predecessor's  storage  than 
for  the  Address  Counter's  own  storage.  This  index  register's  value  is 
set  when  an  Address  Counter  is  activated.  It  is  also  necessary  to  mark 
Address  Counters  used  in  this  way  so  that  an  Address  Counter's  storage 
is  not  released  until  both  that  Address  Counter  and  its  successor  are 
ready  to  return  to  the  pool  of  available  Address  Counters. 

Unfortunately,  allowing  an  Address  Counter  access  to  its  pre- 
decessor's Local  variable  set,  while  conceptually  simple,  causes  an 
increase  in  the  complexity  of  the  machine,  a  larger  number  of  algorithms 
in  the  compiler,  and  a  more  complicated  operating  system.  Since  the 
conditions  under  which  variable  replacement  can  be  done  are  seldom 
satisfied  in  a  real  program,  it  appears  that  the  cost  of  this  improve- 
ment outweighs  the  benefits  derived  from  it. 
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9.  SOFTWARE  DESIGN 

While  examining  many  COBOL  programs  we  found  various  language 
features  and  programming  practices  which  hindered  concurrent  record  pro- 
cessing. Conversely,  some  features  and  techniques  were  found  which 
aided  our  method  of  speedup.  The  language  features  and  programming 
techniques  which  help  and  which  hinder  our  method  are  discussed  in  this 
chapter. 

9.1  Language  Features  Which  Hinder 

9.1.1  ALTER 

A  COBOL  feature  [IBM72]  which  makes  our  method  impossible  to 
apply  is  the  ALTER  command.  This  instruction  causes  the  program  code  to 
be  modified  during  execution  by  overwriting  the  destination  of  a  branch 
instruction.  For  example,  consider  the  code: 

A.  READ  FILE-1. 

B.  GO  TO'Pl. 

pi.   alter'b  TO  PROCEED  TO  P2. 

GO  T0*A. 
P2.   ADD  1  TO  FILE-1-IN. 

READ  FILE-1. 
The  program  analysis  needed  for  concurrent  record  processing 
is  dependent  on  a  priori  knowledge  of  the  program  graph.  Because  the 
ALTER  command  allows  any  branch  in  the  program  to  be  changed  during 
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execution,  the  program  as  analyzed  by  the  compiler  and  the  program  as 
executed  may  have  entirely  different  program  graphs.  Our  algorithms 
would  form  the  graph  in  Figure  9.1(a)  after  insertion  of  FORKs,  HOLDs, 
and  QUITs.  After  the  block  labeled  PI  is  executed,  however,  the 
program  graph  would  look  like  the  one  in  Figure  9.1(b),  which  will  give 
erroneous  results. 

If  a  dialect  of  COBOL  is  implemented  for  our  machine,  the 
ALTER  command  should  be  dropped  from  that  dialect.  The  same  function 
can  be  performed  by  setting  and  testing  in-line  switches,  as  in  the 
following  code: 

77    FIRST-CARD-SW    PICTURE  9  VALUE  1. 


A.  READ  FILE-1. 

B.  IF  FIRST-CARD-SW  =  1 ,  GO  TO  PI 

ELSE  GO  TO  P2. 

P2.   ADD  1*T0  FILE-1-IN. 

READ  FILE-1. 
This  is  properly  handled  by  the  algorithms  presented  in  Chapter  3. 

9.1.2  Subroutines 

Subroutine  calls  are  seldom  used  in  COBOL  since  the  PERFORM 
command  provides  a  yery   powerful  alternative.  Those  calls  which  do 
appear,  however,  cause  a  problem.  Since  we  lack  information  about  the 
way  variables  are  used  in  a  user-supplied  subroutine,  we  must  assume 
the  worst.  We  must  assume  that  all  variables  in  the  argument  list  are 
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used  as  both  input  and  output  variables.  Further,  we  must  assume  that 
all  arguments  are  Reference  Dependent  variables.  For  system-supplied 
subroutines  we  may  be  able  to  relax  these  restrictions  since  we  should 
be  able  to  determine  which  variables  are  not  Reference  Dependent  and 
which  variables  are  not  in  the  input  or  the  output  variable  set  of  the 
CALL  statement.  In  the  dialect  of  COBOL  implemented  for  our  machine, 
the  programmer  should  be  able  to  supply  information  which  informs  the 
compiler  of  the  characteristics  of  the  variables  used  in  subroutine 
calls. 

9.2  Language  Features  Which  Help 

9.2.1  Complex  Operations 

There  are  a  few  statements  in  COBOL,  such  as  TRANSFORM  and 
EXAMINE,  which  call  for  complex  operations.  In  FORTRAN  these  operations 
would  require  a  subroutine  call  or  appreciably  more  than  one  statement, 
rather  than  the  single  COBOL  statement.  These  statements  are  used 
frequently  enough  to  justify  building  hardware  units  to  execute  them. 
Since  a  hardware  unit  can  execute  one  of  these  instructions  in  a  matter 
of  a  few  clock  cycles  rather  than  a  number  of  instruction  cycles,  they 
contribute  to  speeding  up  the  execution  of  a  program. 

9.2.2  SORT 

The  SORT  command  is  another  which  can  contribute  to  program 
speedup  by  being  implemented  in  hardware.  The  networks  described  by 
Batcher  sort  a  large  number  of  records  in  a  relatively  small  number  of 
cycles  compared  to  doing  the  same  operation  in  a  software  algorithm. 


159 


9.3  Programming  Techniques  Which  Hinder 

9.3.1  Sequence  Checking 

Many  COBOL  programs  would  run  very   slowly  on  a  multiple 
Address  Counter  machine  because  of  techniques  which  use  variables  which 
must  be  Reference  Dependent.  The  sequence  checking  done  in  the  example 
in  section  8.2  is  an  example  of  such  a  technique.  In  many  cases  there 
is  an  alternative  to  sequence  checking.  This  alternative  is  sorting  the 
input  file  on  the  sequence  number  before  using  the  file.  On  a  conven- 
tional monoprocessor  this  would  be  a  yery   time-consuming  alternative. 
On  a  multiple  Address  Counter  machine  with  sorting  hardware,  the  oppo- 
site would  be  true. 

9.4  Programming  Techniques  Which  Help 

9.4.1  Super-records 

Some  programs  would  run  relatively  slowly  on  a  multiple 
Address  Counter  machine  simply  because  there  are  READ  statements  on  the 
primary  file  scattered  throughout  the  program.  If  these  READs  could  be 
combined  into  a  single  READ,  the  program  would  execute  more  rapidly 
because  the  decision  of  whether  or  not  that  READ  would  be  executed  again 
could  be  made  sooner  in  the  instruction  stream.  One  way  of  accomplish- 
ing this  is  through  the  use  of  what  we  have  come  to  refer  to  as  a 
"super-record."  An  example  should  make  this  concept  clear. 

Consider  a  program  which  uses  a  "finder"  file  to  update  rec- 
ords in  a  master  file.  Quite  possibly  there  is  more  than  one  update 
record  for  a  given  master  record.  If  all  of  the  updates  for  a  master 
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record  are  combined  to  form  a  super-record,  then  one  READ  statement 
handles  all  of  the  data  for  updating  one  master  record.  Instead  of 
sequencing  through  the  individual  update  records,  requiring  many  inter- 
locks to  keep  instruction  streams  from  interfering,  a  number  of  update 
super-records,  requiring  fewer  interlocks,  could  be  concurrently 
applied  to  a  number  of  master  records.  This  could  yield  an  appreciable 
speed  improvement. 

9.4.2  Parallel  Tasking 

Occasionally  two  phases  of  a  COBOL  program  are  completely 
independent.  Because  of  the  rarity  of  this  occurrence  and  because  of 
the  cost  of  searching  for  this  occurrence,  no  such  search  is  included 
among  the  algorithms  proposed  in  Chapter  3.  However,  in  a  machine 
capable  of  concurrently  executing  more  than  one  instruction  stream,  it 
seems  a  shame  not  to  provide  a  programmer  with  the  necessary  instruc- 
tions for  getting  two  or  more  phases  into  concurrent  execution.  The 
same  three  instructions  described  in  section  3.6  should  be  sufficient. 
In  COBOL- like  syntax,  they  might  look  like  the  following  in  the  dialect 
implemented  on  our  multiple  Address  Counter  machine: 

FORK  TO  procedure-name. 

HOLD. 

QUIT. 
Since  these  instructions  would  generate  the  corresponding  machine 
instructions,  no  special  hardware  or  compiler  algorithms  are  needed  to 
handle  them.  The  only  restriction  on  their  use  would  be  that  all  links 
from  the  concurrently  executed  phases  would  have  to  terminate  on  the 
same  HOLD  instruction. 
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10.  CONCLUSIONS 

10.1  Summary  of  Results 

In  this  thesis  we  have  tried  to  demonstrate  the  following: 

1)  The  technique  of  concurrently  processing  a  number 
of  records  has  the  potential  for  greatly  speeding 
up  the  execution  of  a  business  data  processing 
program. 

2)  Compiler  algorithms  can  be  formulated  to  insert 
the  necessary  FORK,  HOLD,  and  QUIT  commands  to 
start  and  stop  the  concurrent  processing  of 
instruction  streams. 

3)  Compiler  algorithms  can  be  formulated  to  insert 
the  necessary  TEST  and  RELEASE  commands  to  protect 
those  variables  we  call  Reference  Dependent 
variables. 

4)  Hardware,  rather  than  compiler  routines,  can  be 
designed  to  protect  those  variables  we  call 
Reference  Independent  variables  and  to  prevent 
out-of-sequence  references  to  all  variables. 

5)  A  machine  can  be  designed  to  implement  the 
technique  of  concurrent  record  processing. 

From  the  results  of  an  analysis  of  a  number  of  COBOL  programs, 
we  have  been  able  to  determine  values  for  a  number  of  machine  parameters 
These  parameters  and  their  values  are  the  following: 
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1 )  Program  Memory  # 

4096  words  (RAM) 
68  bits  per  word 

2)  Program  Memory  -  Cache 

IK  words   (RAM) 
68  bits  per  word 

3)  Address  Counters 

16  units 

4)  Instruction  Dispatch  Unit  -  Tag  Status  Register  Array 

128  words  (CAM)* 
48  bits  per  word 

5)  Instruction  Dispatch  Unit  -  Instruction  Waiting 
Register  Array 

32  words 

73  bits  per  word  (RAM) 

15  bits  per  word  (CAM) 

6)  Data  Memory 

16  units 

2K  words  (RAM) 

10  characters  per  word 

6  bits  per  character 

7)  Data  Memory  -  Data  Registers 

64  words  (RAM) 


RAM  -  Random  Access  Memory 

CAM  -  Content  Addressable  (Associative)  Memory 
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8)  Data  Memory  -  Address  Registers 

64  words  (CAM) 
11  bits  per  word 

9)  Processors 

15  Arithmetic/Logical  Processors 
1  IF  Tree  Processor 
8  I/O  Processors 
1  SORT  Processor 

10.2  Areas  for  Further  Inquiry 

As  it  might  be  expected,  the  research  described  in  this  paper 
has  raised  a  number  of  questions  which  could  bear  further  research. 
Questions  which  have  occurred  to  us  include  the  following: 

1)  How  can  the  limit  imposed  on  our  machine  by  the 
Instruction  Dispatch  Unit  (section  7.1)  be  avoided? 
The  design  of  the  Instruction  Dispatch  Unit  given 
in  section  5.4  does  perform  the  function  for  which 
it  was  designed—protecting  sequential  access  to  a 
variable  where  this  protection  is  needed  (section  2.2) 
--but  the  speed  we  feel  is  attainable  falls  short  of 
the  speed  we  feel  is  desirable.  If  it  is  not  possi- 
ble to  make  a  large  improvement  in  the  speed  of  this 
unit,  say  by  an  order  of  magnitude,  then  some  other 
way  of  implementing  this  function  should  be  found. 

2)  What  better  compiler  algorithms  should  be  designed 
for  this  machine?  In  Chapter  3  we  presented  a  set 
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of  compiler  algorithms  to  demonstrate  the  feasibility 
of  our  technique.  Due  to  our  training  and  experience 
in  programming  single  instruction  stream  machines, 
these  algorithms  are  designed  to  use  only  one  Address 
Counter.  For  a  machine  capable  of  concurrently  exe- 
cuting several  instruction  streams,  there  must  be 
better  algorithms  and  better  approaches  to  compiler 
design  than  simply  using  the  single  instruction 
stream  techniques  developed  for  past  machines. 

3)  What  techniques,  beyond  those  discussed  in 
section  9.4,  will  allow  a  machine  such  as  this  one 
to  perform  as  well  as  it  has  the  potential  to 
perform?  In  writing  programs  for  a  machine  with  a 
different  structure,  it  is  apparent  that  some  new 
techniques  will  have  to  be  learned  and  some  old  ones 
unlearned. 

4)  How  does  the  availability  of  a  number  of  Address 
Counters  affect  the  design  of  an  operating  system 
for  this  machine?  It  should  make  some  design 
aspects  easier  than  for  a  single  instruction  stream 
machine.  For  example,  instead  of  enqueuing  some 
requests  for  system  resources,  it  may  be  possible 
to  execute  a  FORK  instruction  to  provide  these 
resources.  It  could  also  make  some  design  aspects 
more  challenging,  such  as  trying  to  recover  infor- 
mation about  the  cause  of  a  program  failure 
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(e.g.:  from  which  instruction  stream  did  an  attempt 
to  divide  by  zero  originate?). 

10.3  Final  Comment 

Despite  the  questions  and  problems  the  technique  of  concurrent 
record  processing  raises,  there  appears  to  be  a  very  real  potential  for 
program  speedup  using  this  technique. 
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APPENDIX 

Program  Analysis  Example 
To  demonstrate  the  algorithms  given  in  Chapter  3,  we  present 
in  this  Appendix  a  simplified  example  of  the  analysis  of  a  COBOL  pro- 
gram. The  program,  given  in  Table  A.l,  used  for  this  analysis  is  a 
very  simple  program  constructed  with  many  features  commonly  found  in  the 
real  programs  we  analyzed.  In  essence,  this  program  selects  records 
from  an  input  file,  builds  an  output  file  by  excerpting  data  from  the 
selected  input  records,  and  sorts  this  output  file.  The  program  then 
matches  the  sorted  records  with  records  from  a  master  file  and  prints  a 
report  from  the  data  included  in  the  two  sets  of  records. 

A.l  Source  Text  Scan 

Two  tables  are  presented  which  contain  the  results  of  the  scan 
of  the  source  text.  Table  A. 2  gives  the  correspondence  between  the 
variable  names  used  in  the  program  and  the  variable  numbers  we  use  in 
various  tables.  Note  that  files  are  treated  as  variables  and  that 
FILLER  items  have  been  dropped.  Table  A. 3  presents  the  following 
information  for  each  statement: 

STMT  #    -  The  sequence  number  we  have  assigned  to  the 

statement. 
STATEMENT  -  An  abbreviated  copy  of  the  statement. 
NODE  TYPE  -  The  type  of  node  representing  this  state- 
ment in  the  program  graph.  The  following 
abbreviations  are  used: 
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Table  A.l 
Program  Listing 


000001 
000002 
000003 
000004 
000005 
000006 
000007 
000008 
000009 
000010 
000011 
000012 
000013 
000014 
000015 
000016 
000017 
000018 
000019 
000020 
000021 
000022 
000023 
000024 
000025 
000026 
000027 
000028 
000029 
000030 
000031 
000032 
000033 
000034 
000035 
000036 
000037 
000038 
000039 
000040 
000041 
000042 
000043 
000044 
000045 
000046 


IDENTIFICATION  DIVISION. 
PROGRAM-ID.   SAMPLE. 
ENVIRONMENT  DIVISION. 
INPUT-OUTPUT  SECTION. 
FILE-CONTROL 

SELECT  DEPT-MAST  ASSIGN  TO  UT-S-DPTMAST, 
SELECT  SDT-MAST  ASSIGN  TO  UT-S-SDTMAST. 
SELECT  PRT-FILE  ASSIGN  TO  UT-S-PRTFILE, 
SELECT  SORT-FILE  ASSIGN  TO  UT-S-SRTFILE, 
DATA  DIVISION. 
FILE  SECTION. 
FD  DEPT-MAST 

RECORDING  MODE  IS  F 
BLOCK  CONTAINS  1240  CHARACTERS 
RECORD  CONTAINS  124  CHARACTERS 
LABEL  RECORDS  ARE  STANDARD 
DATA  RECORD  IS  DEPT-REC. 
01  DEPT-REC. 

PIC  X(30). 
PIC  9(9). 
PIC  X(82). 
PIC  XX. 
PIC  X. 


FD 


01 


FD 


01 
SD 


01 


02 
02 
02 
02 
02 


NAME 

SSN 

FILLER 

CODE-1   ■ 

CODE-2 
SDT-MAST 

RECORDING  MODE  IS  F 
BLOCK  CONTAINS  800  CHARACTERS 
RECORD  CONTAINS  80  CHARACTERS 
LABEL  RECORDS  ARE  STANDARD 
DATA  RECORD  IS  MAST-REC. 
MAST-REC. 

02  SSNO        PIC  9(9). 
02  FILLER      PIC  X(64). 
02  UNITS       PIC  99V9. 
02  GPA        PIC  9V999. 
PRT-FILE 

RECORDING  MODE  IS  F 
BLOCK  CONTAINS  1300  CHARACTERS 
RECORD  CONTAINS  130  CHARACTERS 
LABEL  RECORDS  ARE  STANDARD 
DATA  RECORD  IS  PRT-BUF. 
PRT-BUF        PIC  X(130). 
SORT-FILE 

RECORDING  MODE  IS  F 
RECORD  CONTAINS  40  CHARACTERS 
DATA  RECORD  IS  SORT-REC. 
SORT-REC. 
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Table  A.l  (continued) 
Program  Listing 


000047 
000048 
000049 
000050 
000051 
000052 
000053 
000054 
000055 
000056 
000057 
000058 
000059 
000060 
000061 
000062 
000063 
000064 
000065 
000066 
000067 
000068 
000069 
000070 
000071 
000072 
000073 
000074 
000075 
000076 
000077 
000078 
000079 
000080 
000081 
000082 
000083 
000084 
000085 
000086 
000087 
000088 
000089 
000090 
000091 
000092 
000093 


02  SSN        PIC  9(9). 
02  NAME        PIC  X(30). 
02  C0DE-2      PIC  X. 

WORKING-STORAGE  SECTION. 


77  DEPT-CNT 

77  DEPT-SELECTED 

77  SORT-CNT 

77  MAST-CNT 

77  PAGES 

77  LINES 

01  PRINT-REC. 
02  SSN 
02  FILLER 
02  NAME 
02  FILLER 
02  CODE-2 
02  FILLER 
02  UNITS 
02  FILLER 
02  GPA 
02  FILLER 

01  PRT-HEAD. 
02  FILLER 
02  FILLER 
FILLER 
PAGENO 
FILLER 


PIC  9(5) 
PIC  9(5) 
PIC  9(5) 
PIC  9(5) 
PIC  9(5) 
PIC  9(2) 


VALUE 
VALUE 
VALUE 
VALUE 
VALUE 


0. 
0. 
0. 
0. 
0. 


VALUE  52 


PIC  9(9). 

PIC  XX    VALUE  SPACES. 

PIC  X(30). 

PIC  X(5)  VALUE  SPACES. 

PIC  X. 

PIC  XXX   VALUE  SPACES. 

PIC  99  9 

PIC  X(4)"  VALUE  '  — 

PIC  9.999 

PIC  X(68)  VALUE  SPACES. 


02 
02 
02 


PIC  X(50)  VALUE  SPACES. 
PIC  X(20)  VALUE  '  MATCH  REPORT 
PIC  X(40)  VALUE  SPACES. 
PIC  9(5). 

PIC  X(35)  VALUE  SPACES. 
PROCEDURE  DIVISION. 

SORT  SORT-FILE  ASCENDING  SSN  OF  SORT-REC 
INPUT  PROCEDURE  SELECT-DEPT 
OUTPUT  PROCEDURE  PRINT-REPORT. 
GOBACK. 
SELECT-DEPT  SECTION. 

OPEN  INPUT  DEPT-MAST. 
READ-DEPT. 

READ  DEPT-MAST  AT  END  TO  TO  END-SELECT. 
ADD  1  TO  DEPT-CNT. 

IF  CODE-1  =  '26'  GO  TO  READ-DEPT-DONE. 
IF  CODE-1  =  '37'  OR  CODE-2  OF  DEPT-REC  =  'M', 
MOVE  '*'  TO  CODE-2  OF  DEPT-REC, 
GO  TO  READ-DEPT-DONE. 
GO  TO  READ-DEPT. 
READ-DEPT-DONE. 

MOVE  CORRESPONDING  DEPT-REC  TO  SORT-REC. 

RELEASE  SORT-REC. 

ADD  1  TO  DEPT-SELECTED. 

GO  TO  READ-DEPT. 
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Table  A.l  (continued) 
Program  Listing 


000094  END-SELECT. 

000095  CLOSE  DEPT-MAST. 

000096  PRINT-REPORT  SECTION. 

000097  OPEN  INPUT  SDT-MAST,  OUTPUT  PRT-FILE. 

000098  RETURN-DEPT 

000099  RETURN  SORT-FILE  AT  END  GO  TO  END-JOB. 

000100  ADD  1  TO  SORT-CNT. 

000101  READ-SDT. 

000102  READ  SDT-MAST  AT  END 

000103  DISPLAY  '   RAN  OUT  OF  MASTER  RECORDS   '  SORT-REC, 

000104  GO  TO  END-JOB. 

000105  ADD  1  TO  MAST-CNT. 

000106  MATCH-RECS. 

000107  IF  SSN  OF  SORT-REC  >  SSNO,  GO  TO  READ-SDT. 

000108  IF  SSN  OF  SORT-REC  <  SSNO, 

000109  DISPLAY  '  NO  MASTER  RECORD  FOR  STUDENT  '  SORT-REC, 

000110  PERFORM  RETURN-DEPT, 

000111  GO  TO  MATCH-RECS. 

000112  MOVE  CORRESPONDING  SORT-REC  TO  PRINT-REC. 

000113  MOVE  CORRESPONDING  MAST-REC  TO  PRINT-REC. 

000114  IF  LINES  >  50, 

000115  ADD  1  TO  PAGES, 

000116  MOVE  PAGES  TO  PAGENO, 

000117  WRITE  PRT-BUF  FROM  PRT-HEAD, 

000118  MOVE  0  TO  LINES. 

000119  WRITE  PRT-BUF  FROM  PRINT-REC. 

000120  ADD  1  TO  LINES. 

000121  GO  TO  RETURN-DEPT. 

000122  END-JOB. 

000123  CLOSE  SDT-MAST,  PRT-FILE. 

000124  DISPLAY  DEPT-CNT,  DEPT-SELECTED,  SORT-CNT,  MAST-CNT. 
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Table  A. 2 
Variable  Names 

No.  Name 


1 

DEPT-MAST 

2 

DEPT-REC.NAME 

3 

.SSN 

4 

.C0DE-1 

5 

.CODE-2 

6 

SDT-MAST 

7 

MAST-REC.SSNO 

8 

.UNITS 

9 

.GPA 

10 

PRT-FILE 

11 

PRT-BUF 

12 

SORT-FILE 

13 

SORT-REC.SSN 

14 

.NAME 

15 

.CODE-2 

16 

DEPT-CNT 

17 

DEPT-SELECTED 

18 

SORT-CNT 

19 

MAST-CNT 

20 

PAGES 

21 

LINES 

22 

PRINT-REC.SSN 

23 

.NAME 

24 

.CODE-2 

25 

.UNITS 

26 

.GPA 

27 

PRT-HEAD.PAGENO 

28 

SYSOUT 
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BAS   -  Block  of  assignment  statements. 

IF    -  IF  statement. 

ON    -  ON  or  AT  condition  attached  to 

the  preceding  statement. 
READ  -  Input  statement. 
SYST  -  Routine  other  than  I/O  which 

should  be  provided  by  the  opera- 
ting system. 
WRITE  -  Output  statement. 

I  VAR    -  The  set  of  input  variables  for  the  state- 
ment. Items  prefaced  by  an  equal  sign  (=) 
are  constants  having  the  appropriate  value. 
For  example,  ='M'  is  a  constant  whose  value 
is  the  character  M. 

0  VAR    -  The  set  of  output  variables  for  the  statement. 

PRED     -  The  numbers  of  the  statements  which  imme- 
diately precede  the  statement  on  some  path 
through  the  program. 

SUC      -  The  numbers  of  the  statements  which  imme- 
diately follow  the  statement  on  some  path 
through  the  program. 
These  last  two  entries  provide  the  program  graph. 

A. 2  Phase  and  Link  Identification 

Figure  A.l  is  a  sketch  of  the  program  graph.  It  is  apparent 
that  there  are  two  phases  in  this  program.  One  consists  of  nodes  2 
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Figure  A.l 
Program  Graph 
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through  10,  and  the  other  consists  of  nodes  14  through  35  with  the 
exclusion  of  node  19.  Nodes  1,  11,  12,  13,  19,  36,  and  37  are  in  vari- 
ous links  in  the  program.  The  statement  numbers  shown  in  parentheses 
at  some  nodes  are  the  comparison  operations  needed  for  the  corresponding 
IF  statement;  (29),  for  example,  is  the  comparison  (of  LINES  to  the 
value  50)  whose  result  is  used  in  executing  the  conditional  branch  at 
node  29. 

A. 3  Statement  Migration 

Very   little  statement  migration  is  possible  in  this  program. 
Figure  A. 2  shows  the  movement  of  statements  8  and  10  in  Phase  2. 
Figure  A. 3  shows  the  movement  of  statements  33  and  35.  In  both  figures, 
only  the  portion  of  the  phase  affected  is  shown. 

A. 4  Variable  Type  Identification 

Table  A. 4  shows  the  types  of  each  of  the  variables  used  in  the 
program.  The  sets  of  nodes  in  which  the  variables  appear  as  input  and 
as  output  variables  are  also  shown.  The  abbreviations  used  for  the 
variable  types  are  the  following: 

C  -  Constant 
L  -  Local 

RI  -  Reference  Independent 
RD  -  Reference  Dependent 
NU  -  Not  used  in  this  phase 
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Table 

A. 4 

V 

ariable 

Types 

Variable 

I-set 

Phase  1 
0-set 

Type 

I-set 

Phase 

2 
0-set 

Ty 

1 

2 

- 

RD 

- 

-. 

NU 

2 

8 

2 

L 

- 

- 

NU 

3 

8 

2 

L 

- 

- 

NU 

4 

5,6,8 

2 

L 

- 

- 

NU 

5 

6,8 

2,7 

L 

- 

- 

NU 

6 

- 

- 

NU 

17 

- 

RD 

7 

- 

- 

NU 

21,22 

17 

L 

8 

- 

- 

NU 

28 

17 

L 

9 

- 

- 

NU 

28 

17 

L 

10 

- 

- 

NU 

- 

32,34 

RD 

11 

- 

- 

NU 

- 

32,34 

L 

12 

- 

9 

RD 

14,24 

- 

RD 

13 

9 

8 

L 

19,21, 

22,23,27 

14,24 

L 

14 

9 

8 

L 

19,23, 

27 

14,24 

L 

15 

9 

8 

L 

19,23, 

27 

14,24 

L 

16 

4 

4 

RI 

- 

- 

NU 

17 

10 

10 

RI 

- 

- 

NU 

18 

- 

- 

NU 

16,26 

16,26 

RI 

19 

- 

- 

NU 

20 

20 

RI 

20 

- 

- 

NU 

30,31 

30 

RI 

21 

- 

- 

NU 

29,35 

33,35 

RD 

22 

- 

- 

NU 

34 

27 

23 

- 

- 

NU 

34 

27 

24 

- 

- 

NU 

34 

27 

25 

- 

- 

NU 

34 

28 

26 

- 

- 

NU 

34 

28 

27 

- 

- 

NU 

32 

31 

RD 

28 

_ 

_ 

NU 

_ 

19,23 

RD 
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A. 5  Storage  Assignment 

Table  A. 5  shows  the  D,  A,  Q,  and  S  sets,  defined  in 
section  3.5,  for  each  variable  for  which  storage  is  to  be  assigned. 
Note  that  the  files  and  the  buffer  named  PRT-BUFR  have  been  dropped. 

Table  A. 6  shows  our  storage  unit  assignment.  This  assign- 
ment is  shown  for  a  memory  size  of  10  characters  per  word.  The 
utilization  of  the  memory  is  given  as  a  percentage  of  the  number  of 
characters  available  in  the  memories  needed  for  this  program. 

A. 6  Positioning  FORK,  HOLD,  and  QUIT  Instructions 

Figure  A. 4  shows  the  program  after  we  have  inserted  the  FORK, 
HOLD,  and  QUIT  instructions  needed  in  each  phase. 

A. 7  Inserting  Interlocks 

Figure  A. 4  also  shows  the  program  after  we  have  inserted  the 
necessary  TEST  and  RELEASE  instructions  for  each  phase. 
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Table  A. 6 
Storage  Unit  Assignment 


Word  size  =  10  characters 


Memory  Units    1 

2 

3 

4 

5 

6 

2a 
£   (Local) 

_Q 

2b 

2c 

3 

4 
5 

£           14a 

14b 

14c 

13 

15 

>   (Global)    16 

17 

#  Variables 
§   Memories 

#  Local  words 

#  Global  words 

#  Local  characters  i 

#  Local  characters  i 

#  Global  characters 

#  Global  characters 

ivailable 
jsed 

avail abl 
used 

e  = 

9 

6 

2 

1 
120 

82  (68.3%) 
60 
10  (16.7%) 

Phase  2: 


Word  size  =  10  characters 


Memory  Units    1 

2 

3 

4 

5 

6 

7 

8 

13 

14a 

14b 

14c 

co   (Local) 

9 

15 

£           26 

25 

22 

23a 

23b 

23c 

-Q 
fO 

24 

1   (Global)    l°7 

21 

19 

#  Variables 

= 

15 

#  Memories 

= 

6 

#  Local  words 

= 

2 

#  Global  words 

= 

1 

#  Local  characters 

avail  a 

ble  = 

120 

#  Local  characters 

used 

= 

102  (85.0%) 

#  Global  characters 

avail 

able  = 

60 

#  Global  characters 

used 

= 

17  (28.3%) 
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/        TEST        \ 
\  SORT- FILE  / 


T 


/    RELEASE    \ 
\  SORT-FILE  / 


CQUIT 


12 


13 


f 


I 


h 


Figure  A. 4 
Final  Program  Graph 
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Figure  A. 4  (continued) 
Final   Program  Graph 


RELEASE 
PA6EN0 


I 


35 


< 


TEST 
PRT-FILE 


/RELEASE       \ 
\     LINES  / 
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\      PAGE NO      / 


30,31,33,35 


I 


> 


/     RELEASE 
(        PAGE NO 
\        LINES 

~r. 

/ TEST \ 
\    PRT-FILE  / 


32 


I 


34 


/    RELEASE    \ 
\    PRT-FILE  / 

^LMT> 


36 


T 


37 


<$UIT> 


Figure  A. 4  (continued) 
Final  Program  Graph 
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